The SPOT-Trankit model for linguistic processing of written and spoken Slovenian 1.3

Name: The SPOT-Trankit model for linguistic processing of written and spoken Slovenian 1.3
License: https://opensource.org/licenses/Apache-2.0

Krsnik, Luka; Dobrovoljc Zor, Kaja; Terčon, Luka

Show simple item record

dc.contributor.author	Krsnik, Luka
dc.contributor.author	Dobrovoljc Zor, Kaja
dc.contributor.author	Terčon, Luka
dc.date.accessioned	2026-05-26T08:26:46Z
dc.date.available	2026-05-26T08:26:46Z
dc.date.issued	2026-05-25
dc.identifier.uri	http://hdl.handle.net/11356/2201
dc.description	This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type. To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. Version 1.3 was trained on the same data as version 1.2, except that spoken SST data (UD v2.15) was augmented by colloquial (non-standardized) transcriptions of spoken Slovenian alongside the standardized ones. The resulting model achieves state-of-the-art performance on both standardized (e.g. "včasih govorimo takole") and colloquial speech transcriptions (e.g. "včas govorimo tkole"), without affecting the performance on written data.
dc.language.iso	slv
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby	https://doi.org/10.51663/pnz.65.3.01
dc.relation.replaces	http://hdl.handle.net/11356/1997
dc.rights	Apache License 2.0
dc.rights.uri	https://opensource.org/licenses/Apache-2.0
dc.rights.label	PUB
dc.source.uri	https://github.com/clarinsi/trankit-train
dc.subject	language model
dc.subject	lemmatisation
dc.subject	tokenisation
dc.subject	sentence segmentation
dc.subject	part-of-speech tagging
dc.subject	feature prediction
dc.subject	parsing
dc.subject	dependency parsing
dc.subject	corpus annotation
dc.title	The SPOT-Trankit model for linguistic processing of written and spoken Slovenian 1.3
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Luka Krsnik krsnik.luka92@gmail.com Luka Krsnik
contact.person	Kaja Dobrovoljc Zor kaja.dobrovoljc@ijs.si Jožef Stefan Institute
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) Z6-4617 Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds
sponsor	ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor	ARIS (Slovenian Research and Innovation Agency) J6-70213 MAPCASE: Mapping case architectures in Slovenian and across languages nationalFunds
files.count	1
files.size	152616684

Files in this item

This item is

Publicly Available

and licensed under:
Apache License 2.0

Name: trankit-sl-ssj+sststand+sstpog.zip
Size: 145.55 MB
Format: application/zip
Description: Language Model
MD5: ff1f3b86a4996fd5944db14725c602d8

Download file Preview

File Preview

save_dir_ssj+sst-stan+sst-pog
- xlm-roberta-base
  - customized
    - customized_lemmatizer.pt5 MB
    - customized.downloaded1 B
    - customized.vocabs.json90 kB
    - customized.tokenizer.mdl9 MB
    - customized.tagger.mdl23 MB
- xlm-roberta-large
  - customized
    - customized_lemmatizer.pt5 MB
    - customized.downloaded1 B
    - customized.vocabs.json90 kB
    - customized.tokenizer.mdl48 MB
    - customized.tagger.mdl70 MB

Show simple item record

Files in this item

Partners

Partners

Repository