Skip to content

linalgo/wsd

Repository files navigation

main License: MIT PyPI - Version

Practical Word Sense Disambiguation

Fast, efficient, open source & open data Word Sense Disambiguation models for practial use.

Installation

The easiest way to install wsd is to use pip:

pip install wsd[all]

You will also need the The JMDict Project dictionary. You can use the following helper to download the file:

python -m wsd download jmdict

Getting Started

Currently, only JMDict model is available. The model has not been trained yet and will currently returns all matching entries found in the The JMDict Project.

The JMDict model can be imported from the wsd.models module:

from wsd.models import JMDict

jmdict = JMDict()

From there, you can use it to search all relevant entries in the dictionary:

for entry in jmdict.search("かんじ"):
    print(entry)
# Output:
# Entry(ent_seq='1210280', ...
# Entry(ent_seq='1211690', ...
# ...

Alternatively, you can use the predict method to get the unique ent_seq of the best entry:

jmdict.predict("かんじ")
# Output:
# '1210280'

Adding more data

To contribute more data:

The interface has been designed to work on mobile phones to make it easy to contribute whenever you have 5mn available.

Training a model

First, you will need some data to train on. Currently, only the Japanese dataset is available. You can download it with:

python -m wsd download dataset

Once the data is available locally. You can read it and train a model.

import os

from sklearn.linear_model import LogisticRegression

from wsd.models import JMDictWithPointWiseRanking
from wsd.utils import load_dataset

basedir = os.getenv("PJ_DIR")
X, y = load_dataset(f"{basedir}/data/dataset.xml")
X = ["".join(x) for x in X]  # The baseline model already has a tokenizer.

model = LogisticRegression(C=10, penalty="l1", solver="liblinear")
jmdict = JMDictWithPointWiseRanking(ranking_model=model)
jmdict.fit(X, y)
jmdict.search("感じ")

# Output:
# Entry(ent_seq='1210280', ...
# Entry(ent_seq='1211690', ...
# ...

Build using Docker

See Using Docker

Attribution and LICENSE

About

Fast, efficient open source & open data Word Sense Disambiguation models for practial use.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5