You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Added Turkish language support.
- Added Tamil language support.
- Added Italian language support.
- Added Afrikaans language support.
- Improved documentation, especially with regards to adding additional languages.
Keywords describe the main topics expressed in a document/text. Keyword *extraction* in turn allows for the extraction of important words and phrases from text. This in turn can be used for building a list of tags or to build a keyword search index or grouping similar content by its topics and much more. This library provides an easy method for PHP developers to get a list of keywords and phrases from a string of text.
10
+
Keywords describe the main topics expressed in a document/text. Keyword *extraction* in turn allows for the extraction of important words and phrases from text.
11
11
12
-
This project is based on another project called [RAKE-PHP](https://github.com/Richdark/RAKE-PHP) by Richard Filipčík, which is a translation from a Python implementation simply called [RAKE](https://github.com/aneesha/RAKE).
12
+
Extracted keywords can be used for things like:
13
+
- Building a list of useful tags out of a larger text
14
+
- Building search indexes and search engines
15
+
- Grouping similar content by its topic.
13
16
14
-
*As described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010).
17
+
Extracted phrases can be used for things like:
18
+
- Highlighting important areas of a larger text
19
+
- Language or documentation analysis
20
+
- Building intelligent searches based on contextual terms
21
+
22
+
This library provides an easy method for PHP developers to get a list of keywords and phrases from a string of text
23
+
and is based on another smaller and unmaintained project called [RAKE-PHP](https://github.com/Richdark/RAKE-PHP) by Richard Filipčík,
24
+
which is a translation from a Python implementation simply called [RAKE](https://github.com/aneesha/RAKE).
25
+
26
+
> *As described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010).
15
27
[Automatic Keyword Extraction from Individual Documents](https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents).
16
28
In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.*
17
29
18
-
19
30
This particular package intends to include the following benefits over the original [RAKE-PHP](https://github.com/Richdark/RAKE-PHP) package:
*[orthosie](https://github.com/orthosie): Tamil language.
74
+
*[ScIEnzY](https://github.com/ScIEnzY): Italian language.
54
75
55
76
## Installation
56
77
@@ -423,54 +444,95 @@ Array
423
444
424
445
## How to add additional languages
425
446
426
-
**Using the stopwords extractor tool**
447
+
The library requires a list of "stopwords" for each language. Stopwords are common words used in a language such as "and", "are", "or", etc.
427
448
428
-
The library requires a list of "stopwords" for each language. Stopwords are common words used in a language such as "and", "are", "or", etc. An example list of such stopwords can be found [here (en_US)](http://www.lextek.com/manuals/onix/stopwords2.html). You can also [take a look at this list](https://github.com/Donatello-za/stopwords-json) which have stopwords for 50 different languages in individual JSON files.
449
+
There are [stopwords for 50 languages](https://github.com/Donatello-za/stopwords-json#languages) (including the ones already supported) available in JSON format.
450
+
If you are lucky enough to have your language listed then you can easily import it into the library. To
451
+
do so, read the section below:
429
452
430
-
When working with a simple list such as in the first example, you can copy and paste the text into a text file and use the extractor tool to convert it into a format that this library can read efficiently. *An example of such a stopwords file that have been copied from the hyperlink above have been included for your convenience (console/stopwords_en_US.txt)*
453
+
**Using the stopwords extractor tool**
431
454
432
-
Alternatively you can extract the stopwords from a JSON file of which an example have also been supplied, look under `console/stopwords_en_US.json`
455
+
> Note: These instructions assumes you are using Linux
433
456
434
-
**Note:** Simply replace `en_US` to whatever locale you wish to use in the examples below.
457
+
We will be using the Greek language as an example:
435
458
436
-
**Important:** Before using the `extractor` tool, make sure to use the following Linux command to check whether your locale is supported:
459
+
1. Check to see if your operating have the Greek localisation files, the Greek locale
460
+
code you have to look for is: `el_GR`. So run the command `$ locale -a` to see if it is listed.
461
+
2. If it is not listed, you'll need to create it, so run:
437
462
438
463
```sh
439
-
$ locale -a
464
+
sudo locale-gen el_GR
465
+
sudo locale-gen el_GR.utf8
440
466
```
441
467
442
-
If you do not see the locale you wish to use in the list you can install it as follows: (in this case we are installing the French locale):
468
+
3. Go the [list of stopword files](https://github.com/Donatello-za/stopwords-json#languages) and
469
+
find the Greek language, the file will be called `el.json` and it will contain 75 stopwords.
470
+
4. Download the `el.json` file and store it somewhere on your system.
471
+
5. In you terminal, go to the directory of the `rake-php-plus` library, it will
472
+
be under `vendor/donatello-za/rake-php-plus` if you used Composer to install it.
473
+
474
+
We now need to use the JSON file to create two new files, one will be a `.php` file
475
+
that contains the stopwords as a PHP array and one fill be a `.pattern` file which
476
+
is a text file containing the stopwords as a regular expression:
477
+
478
+
1. Extract and convert the .json file to a PHP file by running:
It will output the results to the terminal. You will notice that the results looks like PHP and in fact it is. You can write the results directly to a PHP file by piping it:
To improve the initial loading speed of the language file within RakePlus, you can also set the exporter to produce the results as a regular expression pattern using the `--output` argument:
0 commit comments