Fast, efficient, open source & open data Word Sense Disambiguation models for practial use.
The easiest way to install wsd
is to use pip:
pip install wsd[all]
You will also need the The JMDict Project dictionary. You can use the following helper to download the file:
python -m wsd download jmdict
Currently, only JMDict
model is available.
The model has not been trained yet and will currently returns all matching
entries found in the The JMDict Project.
The JMDict
model can be imported from the wsd.models
module:
from wsd.models import JMDict
jmdict = JMDict()
From there, you can use it to search all relevant entries in the dictionary:
for entry in jmdict.search("かんじ"):
print(entry)
# Output:
# Entry(ent_seq='1210280', ...
# Entry(ent_seq='1211690', ...
# ...
Alternatively, you can use the predict
method to get the unique ent_seq
of
the best entry:
jmdict.predict("かんじ")
# Output:
# '1210280'
To contribute more data:
The interface has been designed to work on mobile phones to make it easy to contribute whenever you have 5mn available.
First, you will need some data to train on. Currently, only the Japanese dataset is available. You can download it with:
python -m wsd download dataset
Once the data is available locally. You can read it and train a model.
import os
from sklearn.linear_model import LogisticRegression
from wsd.models import JMDictWithPointWiseRanking
from wsd.utils import load_dataset
basedir = os.getenv("PJ_DIR")
X, y = load_dataset(f"{basedir}/data/dataset.xml")
X = ["".join(x) for x in X] # The baseline model already has a tokenizer.
model = LogisticRegression(C=10, penalty="l1", solver="liblinear")
jmdict = JMDictWithPointWiseRanking(ranking_model=model)
jmdict.fit(X, y)
jmdict.search("感じ")
# Output:
# Entry(ent_seq='1210280', ...
# Entry(ent_seq='1211690', ...
# ...
See Using Docker