| dc.contributor.author |
Krsnik, Luka |
| dc.contributor.author |
Dobrovoljc, Kaja |
| dc.contributor.author |
Terčon, Luka |
| dc.date.accessioned |
2026-05-26T08:26:46Z |
| dc.date.available |
2026-05-26T08:26:46Z |
| dc.date.issued |
2026-05-25 |
| dc.identifier.uri |
http://hdl.handle.net/11356/2201 |
| dc.description |
This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings).
It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/).
In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type.
To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base.
Version 1.3 was trained on the same data as version 1.2, except that spoken SST data (UD v2.15) was augmented by colloquial (non-standardized) transcriptions of spoken Slovenian alongside the standardized ones. The resulting model achieves state-of-the-art performance on both standardized (e.g. "včasih govorimo takole") and colloquial speech transcriptions (e.g. "včas govorimo tkole"), without affecting the performance on written data. |
| dc.language.iso |
slv |
| dc.publisher |
Centre for Language Resources and Technologies, University of Ljubljana |
| dc.relation.isreferencedby |
https://doi.org/10.51663/pnz.65.3.01 |
| dc.relation.replaces |
http://hdl.handle.net/11356/1997 |
| dc.rights |
Apache License 2.0 |
| dc.rights.uri |
https://opensource.org/licenses/Apache-2.0 |
| dc.rights.label |
PUB |
| dc.source.uri |
https://github.com/clarinsi/trankit-train |
| dc.subject |
language model |
| dc.subject |
lemmatisation |
| dc.subject |
tokenisation |
| dc.subject |
sentence segmentation |
| dc.subject |
part-of-speech tagging |
| dc.subject |
feature prediction |
| dc.subject |
parsing |
| dc.subject |
dependency parsing |
| dc.subject |
corpus annotation |
| dc.title |
The Trankit model for linguistic processing of written and spoken Slovenian 1.3 |
| dc.type |
toolService |
| metashare.ResourceInfo#ContentInfo.detailedType |
tool |
| metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent |
true |
| has.files |
yes |
| branding |
CLARIN.SI data & tools |
| contact.person |
Luka Krsnik krsnik.luka92@gmail.com Luka Krsnik |
| contact.person |
Kaja Dobrovoljc Zor kaja.dobrovoljc@ijs.si Jožef Stefan Institute |
| sponsor |
ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
| sponsor |
ARRS (Slovenian Research Agency) Z6-4617 Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds |
| sponsor |
ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds |
| sponsor |
ARIS (Slovenian Research and Innovation Agency) J6-70213 MAPCASE: Mapping case architectures in Slovenian and across languages nationalFunds |
| files.count |
1 |
| files.size |
152616684 |