Description
Preface
As discussed in #31 , Cadmium::Lemmatizer
needs a Token
object with POS and morphology data to work properly and be fully tested.
The aim of this proposal is to implement a Cadmium::POSTagger
that will create such a Token
Object for each input string.
The first tagging algorithm I'm planning to implement is the Viterbi algorithm.
If I can generalize it enough, the plan is to move it to Cadmium::Classifier
so it can be used for other objectives.
I'm also planning to implement later Dynamic feature induction which could also be used for Named Entity Recognition (and so be moved to Classifier
)
The plan like the Tokenizer
module or the Summarizer
is to make it possible to choose a specific algorithm instead of being tied to a single one.
Details
I propose to implement this with these actions :
- Create a
cadmiumcr/pos_tagger
repository - Implement a POC POS tagger with the Viterbi algorithm
- If the algorithm can be generalized (ie not too specific to POS tagging) move it to Cadmium::Classifier::Viterbi
- Move the working POS Tagger to its repository along with english tagging data
- Push other languages data to the
cadmiumcr/languages
repository (I'm not sure yet about this one, without knowing the sizes of the models) - Move the
Token
struct toCadmium::Utils
as it will be used at least by both the POS Tagger and the Lemmatizer