Skip to content

Proposal: Cadmium::Lemmatizer #31

Open
@rmarronnier

Description

@rmarronnier

Preface

Cadmium has a stemmer which is used downstream in several other modules. Its usefulness is not to be questioned.

However relying only on a stemmer will limit Cadmium in different ways :

  • Stemming a word takes out the grammatical meaning of it, which renders POS tagging impossible.
  • Only a handful of languages stemming algorithms are implemented. Some languages are by their nature very difficult to stem.

Lemmatization in its implementation is essentially binding a lookup table (or dictionnary) to lemmas and applying additional rules depending on the token found.
i18n lemmas lookup tables are freely available and MIT compatible.

Details

  • Create a new lemmatizer repository in cadmiumcr
  • Create in it a Cadmium::Lemmatizer module inspired in its form by Cadmium::Util::StopWords
  • Its data folder will only contain the english json file of lemmas (file size of several Mb)
  • The name of the shard will be cadmium_lemmatizer

The real difficulty is, IMO, how to deal with data for other languages.

Here are several realistic possibilities :

  • Create a cadmium_i18n_data shard containing all languages data (might weight tens of Mo of JSON)
  • Create a cadmium_XX_data shard for each language containing its own data. (Can we regroup repos in a folder in Github ?)
  • Host somewhere the data and ask developers to download it according to their needs.
  • Don't provide the data at all and point developers to possible sources.

That's what I could come up with as solutions but if you have other ideas, do tell !

References

Spacy has a good implementation of lemmatizers.

You can check their github repository to have an idea of what the data is like : spanish language for example

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions