Description
Preface
Cadmium has a stemmer which is used downstream in several other modules. Its usefulness is not to be questioned.
However relying only on a stemmer will limit Cadmium in different ways :
- Stemming a word takes out the grammatical meaning of it, which renders POS tagging impossible.
- Only a handful of languages stemming algorithms are implemented. Some languages are by their nature very difficult to stem.
Lemmatization in its implementation is essentially binding a lookup table (or dictionnary) to lemmas and applying additional rules depending on the token found.
i18n lemmas lookup tables are freely available and MIT compatible.
Details
- Create a new lemmatizer repository in cadmiumcr
- Create in it a
Cadmium::Lemmatizer
module inspired in its form byCadmium::Util::StopWords
- Its data folder will only contain the english json file of lemmas (file size of several Mb)
- The name of the shard will be
cadmium_lemmatizer
The real difficulty is, IMO, how to deal with data for other languages.
Here are several realistic possibilities :
- Create a cadmium_i18n_data shard containing all languages data (might weight tens of Mo of JSON)
- Create a cadmium_XX_data shard for each language containing its own data. (Can we regroup repos in a folder in Github ?)
- Host somewhere the data and ask developers to download it according to their needs.
- Don't provide the data at all and point developers to possible sources.
That's what I could come up with as solutions but if you have other ideas, do tell !
References
Spacy has a good implementation of lemmatizers.
You can check their github repository to have an idea of what the data is like : spanish language for example