2026-05-21T18:12:36Zhttp://www.clarin.si/repository/oai/request

oai:www.clarin.si:11356/19972024-12-06T13:47:25Zhdl_11356_1023hdl_11356_1024

The Trankit model for linguistic processing of written and spoken Slovenian 1.2 Krsnik, Luka Dobrovoljc, Kaja Terčon, Luka language model lemmatisation tokenisation sentence segmentation part-of-speech tagging feature prediction parsing dependency parsing corpus annotation This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type. To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. In comparison to the previous version, this version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14, https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14) and a substantially extended and improved version of the SST UD treebank (https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.15), thus producing significantly better results for spoken data. In contrast to the previous versions of this model (1.0, 1.1), the model 1.2 was trained on a new SST train-dev-test split introduced in UD v2.15. 2024-12-06 toolService http://hdl.handle.net/11356/1997 slv https://arxiv.org/pdf/2101.03289.pdf http://hdl.handle.net/11356/1965 Apache License 2.0 https://opensource.org/licenses/Apache-2.0 PUB text/plain; charset=utf-8 application/zip downloadable_files_count: 1 Centre for Language Resources and Technologies, University of Ljubljana https://github.com/clarinsi/trankit-train