Cleanup script for Wiktextract data for use with Glossa. Currently only tested for Spanish.
Requires Wiktextract dictionary in JSON Lines format, which can be downloaded from kaikki.org.
Should be invoked with:
python cleanup.py --input input.jsonl --output output.jsonl
Output is also in JSON Lines format, each line a list with two items, the first a word and the second a list of entries with that word. The output is intended to be read in line-by-line and used to construct a dictionary with each word as a key.
Output file for full Spanish dictionary is 196MiB as of 2025-02-10.
The script can optionally generate a list of lemmas present in the processed wiktextract dictionary. Note this ignores any extra lemmas that might be included in the lemmatization table below. This feature can be be invoked with --lemma-list="lemma_list_output"
The script can also prepare a lemmatization table using data from the Corpus Del Español and the Spanish Resource Grammar.
Should be invoked with:
python cleanup.py --input input.jsonl --output output.jsonl --lemma-list="lemmas" --lemmatization-table="table-output.jsonl" --cde-input="cde_forms.txt" --srg-input-dir="srg/freeling/es/MM/"
Wiktextract provides high quality but flawed computational dictionaries based on Wiktionary data. I was inspired by Ebook dictionary creator but needed a number of different features for my purposes.
This script retains the form_of key from Wiktextract, and tries to maintain label and gloss from the original entries, including for lower quality entries. There are code comments explaining some of the choices made.
Most words have only one entry but ~50k have several. Some of these correspond to multiple entries in the Wiktextract data (usually for unrelated etymologies or different parts of speech), and some come from entries which have been split up, eg. intersecting inflected forms for separate lemmas, or lemmas which intersect with inflected forms of other lemmas, etc.
word
- the word formpos
- part of speechgender
- optional, a list of genders as strings'f'
or'm'
.- uses both genders for words where both genders are valid
- unfortunately doesn't include a gender for modern gender-neutral terms like 'amigue'
definitions
- list of dictionaries withdefinition
and optionallylabel
andgloss
fields.label
usually includes morphological, syntactic, or dialectological information, andgloss
is a secondary gloss or disambiguation for the definition.- note that
definition
is missing ornull
in a very small number of entries for Spanish.
forms
- optional, list of inflected formsfull_forms
- optional; single-token lemmas with multi-token forms have the correct form extracted, in these cases the original list of forms is in this key.form_of
- optional, gives the lemma of which the word is an inflected form.- most
form_of
entries have in theirdefinition
s the entry's specific inflection and lemma, eg 'second-person singular imperative of fresar'
- most
from_alt_of
- optional, boolean, identifiesform_of
entries which came from sense with analt_of
tag in the Wiktextract data. These are usually alternative or deprecated spellings.from_forms
- optional, boolean, identifiesform_of
entries which were populated from the list of inflected forms on a lemma, and which were absent as entries in the Wiktextract data.- note that these entries have an empty list for
definitions
- note that these entries have an empty list for