Proposal: Cadmium::POSTagger

## Preface

As discussed in cadmiumcr/cadmium#31 , `Cadmium::Lemmatizer` needs a `Token` object with POS and morphology data to work properly and be fully tested. 
The aim of this proposal is to implement a `Cadmium::POSTagger` that will create such a `Token` Object for each input string.
The first tagging algorithm I'm planning to implement is the [Viterbi algorithm](https://www.freecodecamp.org/news/a-deep-dive-into-part-of-speech-tagging-using-viterbi-algorithm-17c8de32e8bc/).
If I can generalize it enough, the plan is to move it to `Cadmium::Classifier` so it can be used for other objectives.
I'm also planning to implement later [Dynamic feature induction](https://www.aclweb.org/anthology/N16-1031) which could also be used for Named Entity Recognition (and so be moved to `Classifier`)
The plan like the `Tokenizer` module or the `Summarizer` is to make it possible to choose a specific algorithm instead of being tied to a single one.

## Details

I propose to implement this with these actions : 
- Create a `cadmiumcr/pos_tagger` repository
- Implement a POC POS tagger with the Viterbi algorithm
- If the algorithm can be generalized (ie not too specific to POS tagging) move it to Cadmium::Classifier::Viterbi
- Move the working POS Tagger to its repository along with english tagging data
- Push other languages data to the `cadmiumcr/languages` repository (I'm not sure yet about  this one, without knowing the sizes of the models)
- Move the `Token` struct to `Cadmium::Utils` as it will be used at least by both the POS Tagger and the Lemmatizer

## References

[List of existing POS Taggers](https://aclweb.org/aclwiki/POS_Tagging_(State_of_the_art))



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Proposal: Cadmium::POSTagger #32

Preface

Details

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Proposal: Cadmium::POSTagger #32

Description

Preface

Details

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions