Improvements to support naive Bayes classification of text documents

Currently we wrap NaiveBayes.jl, but only allow tabular input (internally converted to matrix) which limits application to NLP and elsewhere. However NaiveBayes.jl itself supports dictionary input. 

There is also a text-specific NaiveBayes classifier in TextAnalysis, which accepts dictionaries (keyed on abstract strings). 

I have never used either package seriously but my gut feeling is that there would be little difference in performance when using dictionaries, and would suggest we simply enhance the existing interface at [MLJNaiveBayesInterface.jl](https://github.com/JuliaAI/MLJNaiveBayesInterface.jl) rather than write a new interface. Also, there would be no reason I see to restrict to abstract string keys - any abstract dictionary with `Integer` values should be supportable. Such objects have the scientific type `Multiset{S}` where `S` is the scitype of the keys. So we could support any `Multiset` as input. 

Another possibility might be to add support for sparse matrices, probably adjoints of julia's `SparseArrayCSC` matrices (input to MLJ model is n x p by convention). However this requires a small generalisation of the NaiveBayes package (needed anyway) which at the moment only supports concrete `Matrix` types.  

Very happy to hear some different suggestions for improving naive bayes support.

@storopoli Perhaps this is something you would be interested in helping out with?
 
cc @pazzo83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements to support naive Bayes classification of text documents #401

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improvements to support naive Bayes classification of text documents #401

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions