2026-05-21T18:50:11Zhttp://www.clarin.si/repository/oai/request

oai:www.clarin.si:11356/19962024-12-06T13:47:53Zhdl_11356_1023hdl_11356_1024

Trankit model for SST 2.15 1.1 Krsnik, Luka Dobrovoljc, Kaja Terčon, Luka language model lemmatisation tokenisation sentence segmentation part-of-speech tagging feature prediction parsing dependency parsing corpus annotation This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank of spoken Slovenian (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.15) featuring transcriptions of spontaneous speech in various everyday settings. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological feature prediction, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). Please note this model has been published for archiving purposes only. For production use, we recommend using the state-of-the art Trankit model available here: http://hdl.handle.net/11356/1965 (v1.2 or newest). The latter was trained on both spoken (SST) and written (SSJ) data, and demonstrates a significantly higher performance to the model featured in this submission. In comparison with version 1.0, this model was trained on a new train-dev-test split of the SST treebank introduced in release UD v2.15. 2024-12-06 toolService http://hdl.handle.net/11356/1996 slv https://arxiv.org/pdf/2101.03289.pdf http://hdl.handle.net/11356/1966 Apache License 2.0 https://opensource.org/licenses/Apache-2.0 PUB text/plain; charset=utf-8 application/zip downloadable_files_count: 1 Centre for Language Resources and Technologies, University of Ljubljana https://github.com/clarinsi/trankit-train