About

This repository contains code utilized for retraining and evaluating models based on Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing. Using this setup, we developed new Trankit models for Slovenian, trained on a more recent and considerably larger version of the Slovenian UD Treebanks than the default Trankit models (trained on UD v2.5).

For a detailed understanding of the inner workings and Trankit library options, please refer to the original documentation. This repository serves as an illustration, demonstrating how to leverage the improved models developed during this project.

Published models

The models were trained on the successive versions of the SSJ UD treebank of written Slovenian, the SST UD treebank of spoken Slovenian, and a combined dataset incorporating both.

For production use, we recommend the latest model, Trankit SSJ+SST-2.15, which achieves state-of-the-art performance for both written and spoken Slovenian.

Release date	Short name	Training Data	Model (CLARIN.SI repository)
2023-09-29	Trankit-SSJ-2.12	SSJ r2.12	zip
2024-01-17	Trankit_SSJ+SST-2.12	SSJ r2.12 + SST r2.12	zip
2024-08-29	Trankit_SSJ-2.14	SSJ r2.14	zip
2024-12-06	Trankit_SST-2.15	SST r2.15	zip
2024-12-06	Trankit_SSJ+SST-2.15	SSJ r2.14 + SST r2.15	zip --> recommended

Usage example

Below, we provide a step-by-step guide on how to use our models with the trankit tool.

Step 1: Initialization

from trankit import Pipeline, trankit2conllu

# Initialize trankit
p = Pipeline(lang='customized', cache_dir='<PATH TO DOWNLOADED MODELS>', embedding='xlm-roberta-large')

Step 2: Process Input

There are two options for processing input:

Option 1 - Using Text Input:

text = 'Example text!'
dict_output = p(text)

Option 2 - Using a Pre-tokenized List:

pretokenized_list = [['Example', 'pre-tokenized', 'list', '!']]
dict_output = p(pretokenized_list)

Step 3: Convert Output to CONLLu Format

# Convert output from dictionary to CONLLu format
conllu_output = trankit2conllu(dict_output)

Performance

The table below reports lemmatization (Lemmas), tagging (UPOS), full morphological analysis (XPOS) and parsing (LAS) performance on the written SSJ-2.14 test set and the spoken SST-2.15 test set.

Model	Model type	SSJ-2.14-test (written)				SST-2.15-test (spoken)
Model	Model type	Lemmas	UPOS	XPOS	LAS	Lemmas	UPOS	XPOS	LAS
Trankit-SSJ-2.14	Written	98.07	99.12	98.24	95.48	98.16	95.33	93.93	79.14
Trankit-SST-2.15	Spoken	94.27	97.74	93.74	91.90	97.90	98.79	96.71	86.54
Trankit-SSJ+SST-2.15	Written+Spoken	98.10	99.17	98.27	95.36	98.85	98.97	98.02	87.93

Acknowledgement

This work was supported by Slovenian Research and Innovation Agency through research project SPOT: A Treebank-Driven Approach to the Study of Spoken Slovenian (Z6-4617) and research programme Language Resources and Technologies for Slovene (P6-0411). Infrastructural support was provided by the Centre for Language Resources and Technologies at the University of Ljubljana (CJVT).

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
SLING_training		SLING_training
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
conllu2text.py		conllu2text.py
eval-classla.py		eval-classla.py
eval.py		eval.py
format_sst.py		format_sst.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Published models

Usage example

Step 1: Initialization

Step 2: Process Input

Option 1 - Using Text Input:

Option 2 - Using a Pre-tokenized List:

Step 3: Convert Output to CONLLu Format

Performance

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

clarinsi/trankit-train

Folders and files

Latest commit

History

Repository files navigation

About

Published models

Usage example

Step 1: Initialization

Step 2: Process Input

Option 1 - Using Text Input:

Option 2 - Using a Pre-tokenized List:

Step 3: Convert Output to CONLLu Format

Performance

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages