Prikaži enostavni zapis vnosa
dc.contributor.author |
Krsnik, Luka |
dc.contributor.author |
Dobrovoljc, Kaja |
dc.date.accessioned |
2023-09-30T17:47:09Z |
dc.date.available |
2023-09-30T17:47:09Z |
dc.date.issued |
2023-09-29 |
dc.identifier.uri |
http://hdl.handle.net/11356/1870 |
dc.description |
This is a retrained Slovenian standard model for Trankit v1.1.1 library (https://pypi.org/project/trankit/). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, feature prediction, and dependency parsing in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/).
The model was trained using a dataset published by Universal Dependencies in release 2.12 (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.12). Due to the larger training dataset compared to the original Trankit v1.1.1 model, this version yields superior results and achieves state-of-the art parsing performance for Slovenian (https://slobench.cjvt.si/leaderboard/view/11).
To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. |
dc.language.iso |
slv |
dc.publisher |
Centre for Language Resources and Technologies, University of Ljubljana |
dc.relation.isreferencedby |
https://arxiv.org/pdf/2101.03289.pdf |
dc.relation.isreplacedby |
http://hdl.handle.net/11356/1963 |
dc.rights |
Apache License 2.0 |
dc.rights.uri |
https://opensource.org/licenses/Apache-2.0 |
dc.rights.label |
PUB |
dc.source.uri |
https://github.com/clarinsi/trankit-train |
dc.subject |
language model |
dc.subject |
lemmatisation |
dc.subject |
tokenisation |
dc.subject |
sentence segmentation |
dc.subject |
part-of-speech tagging |
dc.subject |
feature prediction |
dc.subject |
parsing |
dc.title |
The Trankit model for linguistic processing of standard Slovenian |
dc.type |
toolService |
metashare.ResourceInfo#ContentInfo.detailedType |
tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent |
true |
has.files |
yes |
branding |
CLARIN.SI data & tools |
contact.person |
Luka Krsnik krsnik.luka92@gmail.com Luka Krsnik |
contact.person |
Kaja Dobrovoljc kaja.dobrovoljc@ff.uni-lj.si Faculty of Arts, University of Ljubljana |
sponsor |
ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor |
ARRS (Slovenian Research Agency) Z6-4617 A Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds |
files.count |
1 |
files.size |
149893584 |
Datoteke v tem vnosu
To je vnos
Publicly Available
z licenco:
Apache License 2.0
- Ime
- save_dir_ssj.zip
- Velikost
- 142.95
MB
- Format
- application/zip
- Opis
- Language model
- MD5
- 82631e6e8d6ccc5d30b648d223d71140
Prenesi datoteko
Predogled
Prikaži enostavni zapis vnosa