Skip to content

clarinsi/trankit-train

Repository files navigation

About

This repository contains code utilized for retraining and evaluating models based on Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing. Using this setup, we developed new Trankit models for Slovenian, trained on a more recent and considerably larger version of the Slovenian UD Treebanks than the default Trankit models (trained on UD v2.5).

For a detailed understanding of the inner workings and Trankit library options, please refer to the original documentation. This repository serves as an illustration, demonstrating how to leverage the improved models developed during this project.

Published models

The models were trained on the successive versions of the SSJ UD treebank of written Slovenian, the SST UD treebank of spoken Slovenian, and a combined dataset incorporating both.

For production use, we recommend the latest model, Trankit SSJ+SST-2.15, which achieves state-of-the-art performance for both written and spoken Slovenian.

Release date Short name Training Data Model (CLARIN.SI repository)
2023-09-29 Trankit-SSJ-2.12 SSJ r2.12 zip
2024-01-17 Trankit_SSJ+SST-2.12 SSJ r2.12 + SST r2.12 zip
2024-08-29 Trankit_SSJ-2.14 SSJ r2.14 zip
2024-12-06 Trankit_SST-2.15 SST r2.15 zip
2024-12-06 Trankit_SSJ+SST-2.15 SSJ r2.14 + SST r2.15 zip --> recommended

Usage example

Below, we provide a step-by-step guide on how to use our models with the trankit tool.

Step 1: Initialization

from trankit import Pipeline, trankit2conllu

# Initialize trankit
p = Pipeline(lang='customized', cache_dir='<PATH TO DOWNLOADED MODELS>', embedding='xlm-roberta-large')

Step 2: Process Input

There are two options for processing input:

Option 1 - Using Text Input:

text = 'Example text!'
dict_output = p(text)

Option 2 - Using a Pre-tokenized List:

pretokenized_list = [['Example', 'pre-tokenized', 'list', '!']]
dict_output = p(pretokenized_list)

Step 3: Convert Output to CONLLu Format

# Convert output from dictionary to CONLLu format
conllu_output = trankit2conllu(dict_output)

Performance

The table below reports lemmatization (Lemmas), tagging (UPOS), full morphological analysis (XPOS) and parsing (LAS) performance on the written SSJ-2.14 test set and the spoken SST-2.15 test set.

Model Model type SSJ-2.14-test (written) SST-2.15-test (spoken)
LemmasUPOSXPOSLAS LemmasUPOSXPOSLAS
Trankit-SSJ-2.14Written 98.0799.1298.2495.48 98.1695.3393.9379.14
Trankit-SST-2.15Spoken 94.2797.7493.7491.90 97.9098.7996.7186.54
Trankit-SSJ+SST-2.15Written+Spoken 98.1099.1798.2795.36 98.8598.9798.0287.93

Acknowledgement

This work was supported by Slovenian Research and Innovation Agency through research project SPOT: A Treebank-Driven Approach to the Study of Spoken Slovenian (Z6-4617) and research programme Language Resources and Technologies for Slovene (P6-0411). Infrastructural support was provided by the Centre for Language Resources and Technologies at the University of Ljubljana (CJVT).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •