Prikaži enostavni zapis vnosa

 
dc.contributor.author Terčon, Luka
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2023-05-16T06:55:41Z
dc.date.available 2023-05-16T06:55:41Z
dc.date.issued 2023-05-10
dc.identifier.uri http://hdl.handle.net/11356/1831
dc.description The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200) combined with the Croatian hr500k training dataset (http://hdl.handle.net/11356/1792) to ensure sufficient representation of certain labels. The CLARIN.SI-embed.sr word embeddings (http://hdl.handle.net/11356/1789) were used during training. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.19. The difference to the previous version of the model is that this version was trained on the SETimes.SR corpus expanded with the Croatian hr500k training dataset to ensure sufficient representation of certain labels. it was also trained using the new version of Serbian word embeddings.
dc.language.iso srp
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby http://dx.doi.org/10.18653/v1/W19-3704
dc.relation.replaces http://hdl.handle.net/11356/1349
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/clarinsi/classla
dc.subject language model
dc.subject part-of-speech tagging
dc.title The CLASSLA-Stanza model for morphosyntactic annotation of standard Serbian 2.1
dc.type toolService
metashare.ResourceInfo#ContentInfo.detailedType tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
contact.person Luka Terčon luka.tercon@gmail.com Faculty of Computer and Information Science, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
files.count 2
files.size 188149388


 Datoteke v tem vnosu

 Prenesi vse datoteke v vnosu (179.43 MB)
Icon
Ime
baseline_pos.zip
Velikost
74.31 MB
Format
application/zip
Opis
Language model
MD5
63630c0d882181c90d35a638813c401f
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • baseline_pos80 MB
Icon
Ime
sr_set.pretrain.zip
Velikost
105.12 MB
Format
application/zip
Opis
Pretrained word embeddings
MD5
a5759144e1114ac72660918eeaad8dc8
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • sr_set.pretrain.pt149 MB

Prikaži enostavni zapis vnosa