This repository contains code utilized for retraining and evaluating models based on Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing. Using this setup, we developed new Trankit models for Slovenian, trained on a more recent and considerably larger version of the Slovenian UD Treebanks than the default Trankit models (trained on UD v2.5).
For a detailed understanding of the inner workings and Trankit library options, please refer to the original documentation. This repository serves as an illustration, demonstrating how to leverage the improved models developed during this project.
The models were trained on the successive versions of the SSJ UD treebank of written Slovenian, the SST UD treebank of spoken Slovenian, and a combined dataset incorporating both.
For production use, we recommend the latest model, Trankit SSJ+SST-2.15, which achieves state-of-the-art performance for both written and spoken Slovenian.
| Release date | Short name | Training Data | Model (CLARIN.SI repository) |
|---|---|---|---|
| 2023-09-29 | Trankit-SSJ-2.12 | SSJ r2.12 | zip |
| 2024-01-17 | Trankit_SSJ+SST-2.12 | SSJ r2.12 + SST r2.12 | zip |
| 2024-08-29 | Trankit_SSJ-2.14 | SSJ r2.14 | zip |
| 2024-12-06 | Trankit_SST-2.15 | SST r2.15 | zip |
| 2024-12-06 | Trankit_SSJ+SST-2.15 | SSJ r2.14 + SST r2.15 | zip --> recommended |
Below, we provide a step-by-step guide on how to use our models with the trankit tool.
from trankit import Pipeline, trankit2conllu
# Initialize trankit
p = Pipeline(lang='customized', cache_dir='<PATH TO DOWNLOADED MODELS>', embedding='xlm-roberta-large')There are two options for processing input:
text = 'Example text!'
dict_output = p(text)pretokenized_list = [['Example', 'pre-tokenized', 'list', '!']]
dict_output = p(pretokenized_list)# Convert output from dictionary to CONLLu format
conllu_output = trankit2conllu(dict_output)The table below reports lemmatization (Lemmas), tagging (UPOS), full morphological analysis (XPOS) and parsing (LAS) performance on the written SSJ-2.14 test set and the spoken SST-2.15 test set.
| Model | Model type | SSJ-2.14-test (written) | SST-2.15-test (spoken) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Lemmas | UPOS | XPOS | LAS | Lemmas | UPOS | XPOS | LAS | ||
| Trankit-SSJ-2.14 | Written | 98.07 | 99.12 | 98.24 | 95.48 | 98.16 | 95.33 | 93.93 | 79.14 |
| Trankit-SST-2.15 | Spoken | 94.27 | 97.74 | 93.74 | 91.90 | 97.90 | 98.79 | 96.71 | 86.54 |
| Trankit-SSJ+SST-2.15 | Written+Spoken | 98.10 | 99.17 | 98.27 | 95.36 | 98.85 | 98.97 | 98.02 | 87.93 |
This work was supported by Slovenian Research and Innovation Agency through research project SPOT: A Treebank-Driven Approach to the Study of Spoken Slovenian (Z6-4617) and research programme Language Resources and Technologies for Slovene (P6-0411). Infrastructural support was provided by the Centre for Language Resources and Technologies at the University of Ljubljana (CJVT).