Proposal: Cadmium::Lemmatizer

## Preface

Cadmium has a stemmer which is used downstream in several other modules. Its usefulness is not to be questioned.

However relying only on a stemmer will limit Cadmium in different ways : 

- Stemming a word takes out the grammatical meaning of it, which renders POS tagging impossible.
- Only a handful of languages stemming algorithms are implemented. Some languages are by their nature very difficult to stem.

Lemmatization in its implementation is essentially binding a lookup table (or dictionnary) to lemmas and applying additional rules depending on the token found.
i18n lemmas lookup tables are freely available and MIT compatible.

## Details

- Create a new lemmatizer repository in cadmiumcr
- Create in it a `Cadmium::Lemmatizer` module inspired in its form by `Cadmium::Util::StopWords`
- Its data folder will only contain the english json file of lemmas (file size of several Mb)
- The name of the shard will be `cadmium_lemmatizer`

The real difficulty is, IMO, how to deal with data for other languages.

Here are several realistic possibilities : 

- Create a cadmium_i18n_data shard containing all languages data (might weight tens of Mo of JSON)
- Create a cadmium_XX_data shard for each language containing its own data. (Can we regroup repos in a folder in Github ?)
- Host somewhere the data and ask developers to download it according to their needs.
- Don't provide the data at all and point developers to possible sources.

That's what I could come up with as solutions but if you have other ideas, do tell !



## References

[Spacy](https://spacy.io/usage/linguistic-features) has a good implementation of lemmatizers.

You can check their github repository to have an idea of what the data is like : [spanish language for example](https://github.com/explosion/spaCy/tree/master/spacy/lang/es)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Proposal: Cadmium::Lemmatizer #31

Preface

Details

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Proposal: Cadmium::Lemmatizer #31

Description

Preface

Details

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions