Practical Word Sense Disambiguation

Fast, efficient, open source & open data Word Sense Disambiguation models for practial use.

Installation

The easiest way to install wsd is to use pip:

pip install wsd[all]

You will also need the The JMDict Project dictionary. You can use the following helper to download the file:

python -m wsd download jmdict

Getting Started

Currently, only JMDict model is available. The model has not been trained yet and will currently returns all matching entries found in the The JMDict Project.

The JMDict model can be imported from the wsd.models module:

from wsd.models import JMDict

jmdict = JMDict()

From there, you can use it to search all relevant entries in the dictionary:

for entry in jmdict.search("かんじ"):
    print(entry)
# Output:
# Entry(ent_seq='1210280', ...
# Entry(ent_seq='1211690', ...
# ...

Alternatively, you can use the predict method to get the unique ent_seq of the best entry:

jmdict.predict("かんじ")
# Output:
# '1210280'

Adding more data

To contribute more data:

Create an account on Linhub
Start reviewing entries using this interface

The interface has been designed to work on mobile phones to make it easy to contribute whenever you have 5mn available.

Training a model

First, you will need some data to train on. Currently, only the Japanese dataset is available. You can download it with:

python -m wsd download dataset

Once the data is available locally. You can read it and train a model.

import os

from sklearn.linear_model import LogisticRegression

from wsd.models import JMDictWithPointWiseRanking
from wsd.utils import load_dataset

basedir = os.getenv("PJ_DIR")
X, y = load_dataset(f"{basedir}/data/dataset.xml")
X = ["".join(x) for x in X]  # The baseline model already has a tokenizer.

model = LogisticRegression(C=10, penalty="l1", solver="liblinear")
jmdict = JMDictWithPointWiseRanking(ranking_model=model)
jmdict.fit(X, y)
jmdict.search("感じ")

# Output:
# Entry(ent_seq='1210280', ...
# Entry(ent_seq='1211690', ...
# ...

Build using Docker

See Using Docker

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
data		data
docker		docker
examples		examples
wsd		wsd
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
compose.yml		compose.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Practical Word Sense Disambiguation

Installation

Getting Started

Adding more data

Training a model

Build using Docker

Attribution and LICENSE

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

linalgo/wsd

Folders and files

Latest commit

History

Repository files navigation

Practical Word Sense Disambiguation

Installation

Getting Started

Adding more data

Training a model

Build using Docker

Attribution and LICENSE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages