Skip to content

Commit 5522006

Browse files
committed
v1.0.17 Release
- Added Turkish language support. - Added Tamil language support. - Added Italian language support. - Added Afrikaans language support. - Improved documentation, especially with regards to adding additional languages.
1 parent 567b7e9 commit 5522006

File tree

3 files changed

+164
-38
lines changed

3 files changed

+164
-38
lines changed

README.md

+100-38
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,65 @@
11
# rake-php-plus
2-
Yet another PHP implementation of the Rapid Automatic Keyword Extraction algorithm (RAKE).
2+
A keyword and phrase extraction library based on the Rapid Automatic Keyword Extraction algorithm (RAKE).
33

44
[![Latest Stable Version](https://poser.pugx.org/donatello-za/rake-php-plus/v/stable)](https://packagist.org/packages/donatello-za/rake-php-plus)
55
[![Total Downloads](https://poser.pugx.org/donatello-za/rake-php-plus/downloads)](https://packagist.org/packages/donatello-za/rake-php-plus)
66
[![License](https://poser.pugx.org/donatello-za/rake-php-plus/license)](https://packagist.org/packages/donatello-za/rake-php-plus)
77

8-
## Why is this package useful?
8+
## Introduction
99

10-
Keywords describe the main topics expressed in a document/text. Keyword *extraction* in turn allows for the extraction of important words and phrases from text. This in turn can be used for building a list of tags or to build a keyword search index or grouping similar content by its topics and much more. This library provides an easy method for PHP developers to get a list of keywords and phrases from a string of text.
10+
Keywords describe the main topics expressed in a document/text. Keyword *extraction* in turn allows for the extraction of important words and phrases from text.
1111

12-
This project is based on another project called [RAKE-PHP](https://github.com/Richdark/RAKE-PHP) by Richard Filipčík, which is a translation from a Python implementation simply called [RAKE](https://github.com/aneesha/RAKE).
12+
Extracted keywords can be used for things like:
13+
- Building a list of useful tags out of a larger text
14+
- Building search indexes and search engines
15+
- Grouping similar content by its topic.
1316

14-
*As described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010).
17+
Extracted phrases can be used for things like:
18+
- Highlighting important areas of a larger text
19+
- Language or documentation analysis
20+
- Building intelligent searches based on contextual terms
21+
22+
This library provides an easy method for PHP developers to get a list of keywords and phrases from a string of text
23+
and is based on another smaller and unmaintained project called [RAKE-PHP](https://github.com/Richdark/RAKE-PHP) by Richard Filipčík,
24+
which is a translation from a Python implementation simply called [RAKE](https://github.com/aneesha/RAKE).
25+
26+
> *As described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010).
1527
[Automatic Keyword Extraction from Individual Documents](https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents).
1628
In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.*
1729

18-
1930
This particular package intends to include the following benefits over the original [RAKE-PHP](https://github.com/Richdark/RAKE-PHP) package:
2031

21-
1. Add [PSR-2](http://www.php-fig.org/psr/psr-2/) coding standards.
22-
2. Implement [PSR-4](http://www.php-fig.org/psr/psr-4/) in order to be [Composer](https://getcomposer.org) installable.
23-
3. Add additional functionality such as method chaining.
24-
4. Add multiple ways to provide source stopwords.
32+
1. [PSR-2](http://www.php-fig.org/psr/psr-2/) coding standards.
33+
2. [PSR-4](http://www.php-fig.org/psr/psr-4/) to be [Composer](https://getcomposer.org) installable.
34+
3. Additional functionality such as method chaining.
35+
4. Multiple ways to provide source stopwords.
2536
5. Full unit test coverage.
2637
6. Performance improvements.
2738
7. Improved documentation.
39+
8. Easy language integration and multibyte string support.
2840

2941
## Currently Supported Languages
3042

43+
* Arabic (United Arab Emirates)/لإمارات العربية المتحدة (ar_AE)
44+
* Brazilian Portuguese/português do Brasil (pt_BR)
3145
* English US (en_US)
32-
* Spanish/español (es_AR)
46+
* European Portuguese/português europeu (pt_PT)
3347
* French/le français (fr_FR)
48+
* German (Germany)/Deutsch (Deutschland) (de_DE)
49+
* Italian (Italiano)
3450
* Polish/język polski (pl_PL)
3551
* Russian/русский язык (ru_RU)
36-
* Brazilian Portuguese/português do Brasil (pt_BR)
37-
* European Portuguese/português europeu (pt_PT)
3852
* Sorani Kurdish/سۆرانی (ckb_IQ)
39-
* Arabic (United Arab Emirates)/لإمارات العربية المتحدة (ar_AE)
40-
* German (Germany)/Deutsch (Deutschland) (de_DE)
53+
* Spanish/español (es_AR)
54+
* Tamil (தமிழ்)
55+
* Turkish (Türkçe)
56+
57+
> If your language is not listed here it can be added, please see the section
58+
called **How to add additional languages** at the bottom of the page.
4159

4260
## Version
4361

44-
v1.0.16
62+
v1.0.17
4563

4664
## Special Thanks
4765

@@ -51,6 +69,9 @@ v1.0.16
5169
* [Khoshbin Ali Ahmed](https://github.com/Xoshbin): Sorani Kurdish and Arabic languages.
5270
* [RhaPT](https://github.com/RhaPT): European Portuguese language.
5371
* [Peter Thaleikis](https://github.com/spekulatius): German language.
72+
* [Yusuf Usta](https://github.com/yusufusta): Turkish language.
73+
* [orthosie](https://github.com/orthosie): Tamil language.
74+
* [ScIEnzY](https://github.com/ScIEnzY): Italian language.
5475

5576
## Installation
5677

@@ -423,54 +444,95 @@ Array
423444

424445
## How to add additional languages
425446

426-
**Using the stopwords extractor tool**
447+
The library requires a list of "stopwords" for each language. Stopwords are common words used in a language such as "and", "are", "or", etc.
427448

428-
The library requires a list of "stopwords" for each language. Stopwords are common words used in a language such as "and", "are", "or", etc. An example list of such stopwords can be found [here (en_US)](http://www.lextek.com/manuals/onix/stopwords2.html). You can also [take a look at this list](https://github.com/Donatello-za/stopwords-json) which have stopwords for 50 different languages in individual JSON files.
449+
There are [stopwords for 50 languages](https://github.com/Donatello-za/stopwords-json#languages) (including the ones already supported) available in JSON format.
450+
If you are lucky enough to have your language listed then you can easily import it into the library. To
451+
do so, read the section below:
429452

430-
When working with a simple list such as in the first example, you can copy and paste the text into a text file and use the extractor tool to convert it into a format that this library can read efficiently. *An example of such a stopwords file that have been copied from the hyperlink above have been included for your convenience (console/stopwords_en_US.txt)*
453+
**Using the stopwords extractor tool**
431454

432-
Alternatively you can extract the stopwords from a JSON file of which an example have also been supplied, look under `console/stopwords_en_US.json`
455+
> Note: These instructions assumes you are using Linux
433456
434-
**Note:** Simply replace `en_US` to whatever locale you wish to use in the examples below.
457+
We will be using the Greek language as an example:
435458

436-
**Important:** Before using the `extractor` tool, make sure to use the following Linux command to check whether your locale is supported:
459+
1. Check to see if your operating have the Greek localisation files, the Greek locale
460+
code you have to look for is: `el_GR`. So run the command `$ locale -a` to see if it is listed.
461+
2. If it is not listed, you'll need to create it, so run:
437462

438463
```sh
439-
$ locale -a
464+
sudo locale-gen el_GR
465+
sudo locale-gen el_GR.utf8
440466
```
441467

442-
If you do not see the locale you wish to use in the list you can install it as follows: (in this case we are installing the French locale):
468+
3. Go the [list of stopword files](https://github.com/Donatello-za/stopwords-json#languages) and
469+
find the Greek language, the file will be called `el.json` and it will contain 75 stopwords.
470+
4. Download the `el.json` file and store it somewhere on your system.
471+
5. In you terminal, go to the directory of the `rake-php-plus` library, it will
472+
be under `vendor/donatello-za/rake-php-plus` if you used Composer to install it.
473+
474+
We now need to use the JSON file to create two new files, one will be a `.php` file
475+
that contains the stopwords as a PHP array and one fill be a `.pattern` file which
476+
is a text file containing the stopwords as a regular expression:
477+
478+
1. Extract and convert the .json file to a PHP file by running:
443479

444480
```sh
445-
$ sudo locale-gen fr_FR
446-
$ sudo locale-gen fr_FR.utf8
481+
$ php ./console/extractor.php path/to/el.json --locale=el_GR --output=php > ./some/dir/el_GR.php
447482
```
448483

449-
To extract stopwords from a text file, run the following from the command line:
484+
2. Extract and convert the .json file to a .pattern file by running:
450485

451486
```sh
452-
$ cd ./console
453-
$ php extractor.php stopwords_en_US.txt --locale=en_US --output=php
487+
$ php ./console/extractor.php path/to/el.json --locale=el_GR --output=pattern > ./some/dir/el_GR.pattern
488+
```
489+
490+
That is it! You can now use the new stopwords by specifying it when creating an instance
491+
of the RakePlus class, for example:
492+
493+
```php
494+
$rake = RakePlus::create($text, '/some/dir/el_GR.pattern');
454495
```
455496

456-
To extract stopwords from a JSON file, run the following from the command line:
497+
or
457498

458-
`$ php extractor.php stopwords_en_US.json --locale=en_US --output=php`
499+
```php
500+
$rake = RakePlus::create($text, '/some/dir/el_GR.php');
501+
```
459502

460-
It will output the results to the terminal. You will notice that the results looks like PHP and in fact it is. You can write the results directly to a PHP file by piping it:
503+
**Contribute by Adding a Language**
461504

462-
`$ php extractor.php stopwords_en_US.txt --locale=en_US --output=php > en_US.php`
505+
If you want your language to be officially support, you can fork this library,
506+
generate the `.pattern` and `.php` stopword files as described above, place it
507+
in the `./rake-php-plus/lang/` directory and submit it as a pull request.
463508

464-
Finally, copy the `en_US.php` file to the `lang/` directory and then instantiate php-rake-plus like so:
509+
Once your language is officially supported, you'll be able to specify the language
510+
without having to specify the file to use, for example:
465511

466512
```php
467-
$rake = RakePlus::create($text, 'en_US');
513+
$rake = RakePlus::create($text, 'el_GR');
514+
```
515+
516+
RakePHP will always look for a `.pattern` file first and if not found it will
517+
look for a `.php` file in the `./lang/` directory.
518+
519+
**I don't have a stopwords file for my language, what now?**
520+
521+
If your language is not covered in the [list of 50 languages here](https://github.com/Donatello-za/stopwords-json#languages)
522+
you may have to try and find it elsewhere, try searching for "yourlanguage stopwords". If you
523+
find a list or decide to create your own list, you can also just place it in a standard text
524+
file instead of a .json file and extract the stopwords using the extractor tool, for
525+
example:
526+
527+
```sh
528+
$ php ./console/extractor.php path/to/mystopwords.txt --locale=LOCAL_CODE --output=php > ./some/dir/LOCAL_CODE.php
529+
$ php ./console/extractor.php path/to/mystopwords.txt --locale=LOCAL_CODE --output=php > ./some/dir/LOCAL_CODE.php
468530
```
469-
To improve the initial loading speed of the language file within RakePlus, you can also set the exporter to produce the results as a regular expression pattern using the `--output` argument:
470531

471-
`$ php extractor.php stopwords_en_US.txt --locale=en_US --output=pattern > en_US.pattern`
532+
*Remember to replace `LOCAL_CODE` for the correct local you wish to use.*
472533

473-
RakePHP will always look for a `.pattern` file first and if not found it will look for a `.php` file in the `./lang/` directory.
534+
Here is an example text file containing stopwords that was copied and pasted from a
535+
site: [stopwords_en_US](./console/stopwords_en_US.txt)
474536

475537
## To run tests
476538

lang/af_ZA.pattern

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/\bwat\b|\bwas\b|\bvir\b|\bvan\b|\buit\b|\btoe\b|\bte\b|\bsy\b|\bso\b|\bsien\b|\bse\b|\bsal\b|\bsaam\b|\bop\b|\bons\b|\bom\b|\bnie\b|\bna\b|\bʼn(?!(-|'))\b|\b'n\b|\bmy\b|\bmet\b|\bmaar\b|\bma\b|\bkom\b|\bkan\b|\bjy\b|\bjou\b|\bis\b|\bin\b|\bhy\b|\bhulle\b|\bhom\b|\bhet\b|\bhaar\b|\bgesê\b|\bgaan\b|\ben\b|\bek\b|\been\b|\bdit\b|\bdie\b|\bdat\b|\bdag\b|\bdaar\b|\bby\b|\bbaie\b|\bas\b|\bal\b|\baf\b|\baan\b/i

lang/af_ZA.php

+63
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
<?php
2+
3+
/**
4+
* Stopwords list for the use in the PHP package rake-php-plus.
5+
* See: https://github.com/Donatello-za/rake-php-plus
6+
*
7+
* Extracted using extractor.php @ 2021-06-21T12:26:39+00:00
8+
*/
9+
10+
return [
11+
'wat',
12+
'was',
13+
'vir',
14+
'van',
15+
'uit',
16+
'toe',
17+
'te',
18+
'sy',
19+
'so',
20+
'sien',
21+
'se',
22+
'sal',
23+
'saam',
24+
'op',
25+
'ons',
26+
'om',
27+
'nie',
28+
'na',
29+
'ʼn',
30+
'\'n',
31+
'my',
32+
'met',
33+
'maar',
34+
'ma',
35+
'kom',
36+
'kan',
37+
'jy',
38+
'jou',
39+
'is',
40+
'in',
41+
'hy',
42+
'hulle',
43+
'hom',
44+
'het',
45+
'haar',
46+
'gesê',
47+
'gaan',
48+
'en',
49+
'ek',
50+
'een',
51+
'dit',
52+
'die',
53+
'dat',
54+
'dag',
55+
'daar',
56+
'by',
57+
'baie',
58+
'as',
59+
'al',
60+
'af',
61+
'aan'
62+
];
63+

0 commit comments

Comments
 (0)