Conversation
Add file `words_alpha_clean.txt` that a of `words_alpha.txt` but the words that not exists in english has been removed. The sort has been effectuate with the `API` of [wordsapi](https://www.wordsapi.com/) that allow the search of words in english, from a script i've call the API for each word, and during not exists word i've remove word from a file. You can find the doc API [here](https://www.wordsapi.com/docs/). The exact filter of a word is based on `frequency` data of API ```javascript if(!!response.word && typeof response.frequency == "object") { if(response.frequency.perMillion >= 15) { // here word is not removed realWords.push(response.word); } // else word is removed } ``` The documentation indicate this below text for [frequency](https://www.wordsapi.com/docs/#frequency) data: > This is the number of times the word is likely to appear in any English corpus, per million words.
|
Nice work but from 350000+ lines only around 2500 survived? Seems like the parameters used have been a little too strict... |
|
I used less strict filter with |
|
|
The API is free for 2500 words per day. That is probably why.... |
@Orivoir did get |
|
maybe it kills too much words. for example: white lives matter, too [:joke:] |
|
Hi all, I have run the words_alpha.txt through the "nltk" python library. Total words are 210693. This seems to be a bit better, but I have noticed there are still a few oddities in there (maybe things like common abbreviations remain, which aren't actual words). But overall I think this has cleaned out any non-english words. |
|
@SDidge appreciate the share! |
|
@SDidge At first glance I can't seem to find any non-english words on the file so I'd say this one is the cleanest file so far, nice work! |
@SDidge , what exactly did you use from the NLTK library to check the list of words? |
|
@Timokasse , I just checked if the word existed in the "words" corpus E.g. from nltk.corpus import words word for words_alpha if word in words Something like this |
Add file
words_alpha_clean.txtthat a copy ofwords_alpha.txtbut the words that not exists in english has been removed.The sort has been effectuate with the
APIof wordsapi that allow the search of words in english, from a script i've call the API for each word,and during not exists word i've remove word from a file.
You can find the doc API here.
The exact filter of a word is based on
frequencydata of APIThe documentation indicate this below text for frequency data: