2026-07-22T02:13:02Zhttp://www.clarin.si/repository/oai/request

oai:www.clarin.si:11356/19092026-05-26T08:28:21Zhdl_11356_1023hdl_11356_1024

Trankit model for linguistic processing of spoken Slovenian Krsnik, Luka Dobrovoljc, Kaja language model lemmatisation tokenisation sentence segmentation part-of-speech tagging feature prediction parsing This is a retrained Slovenian spoken language model for Trankit v1.1.1 library (https://pypi.org/project/trankit/). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, feature prediction, and dependency parsing in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a combination of two datasets published by Universal Dependencies in release 2.12, the spoken SST treebank (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.12) and the written SSJ treebank (https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.12). Its evaluation on the spoken SST test set yields an F1 score of 97.78 for lemmas, 97.19 for UPOS, 95.05 for XPOS and 81.26 for LAS, a significantly better performance in comparison to the counterpart model trained on written SSJ data only (http://hdl.handle.net/11356/1870). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. 2024-01-17 toolService http://hdl.handle.net/11356/1909 slv https://arxiv.org/pdf/2101.03289.pdf http://hdl.handle.net/11356/1965 Apache License 2.0 https://opensource.org/licenses/Apache-2.0 PUB text/plain; charset=utf-8 application/zip downloadable_files_count: 1 Centre for Language Resources and Technologies, University of Ljubljana https://github.com/clarinsi/trankit-train