Prikaži enostavni zapis vnosa

 
dc.contributor.author Krsnik, Luka
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Terčon, Luka
dc.date.accessioned 2024-09-02T13:27:49Z
dc.date.available 2024-09-02T13:27:49Z
dc.date.issued 2024-09-02
dc.identifier.uri http://hdl.handle.net/11356/1965
dc.description This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an almost identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type. To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. In comparison to the previous version, this version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14, https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14) and a substantially extended and improved version of the SST UD treebank (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/dev), thus producing significantly better results for spoken data.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby https://arxiv.org/pdf/2101.03289.pdf
dc.relation.replaces http://hdl.handle.net/11356/1909
dc.relation.replaces http://hdl.handle.net/11356/1966
dc.relation.isreplacedby http://hdl.handle.net/11356/1997
dc.rights Apache License 2.0
dc.rights.uri https://opensource.org/licenses/Apache-2.0
dc.rights.label PUB
dc.source.uri https://github.com/clarinsi/trankit-train
dc.subject language model
dc.subject lemmatisation
dc.subject tokenisation
dc.subject sentence segmentation
dc.subject part-of-speech tagging
dc.subject feature prediction
dc.subject parsing
dc.subject dependency parsing
dc.subject corpus annotation
dc.title The Trankit model for linguistic processing of spoken and written Slovenian 1.1
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding CLARIN.SI data & tools
contact.person Luka Krsnik krsnik.luka92@gmail.com Luka Krsnik
contact.person Kaja Dobrovoljc kaja.dobrovoljc@ff.uni-lj.si Faculty of Arts, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) Z6-4617 Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds
files.count 1
files.size 152501806


 Datoteke v tem vnosu

To je vnos
Publicly Available
z licenco:
Apache License 2.0
Icon
Ime
save_dir_ssj2.14+sst2.15-dev.zip
Velikost
145.44 MB
Format
application/zip
Opis
Language model
MD5
3ca0b9aed6bccf4a245d7bbe7f15c845
 Prenesi datoteko  Predogled
 Predogled datoteke  

Prikaži enostavni zapis vnosa