dc.contributor.author | Terčon, Luka |
dc.contributor.author | Ljubešić, Nikola |
dc.date.accessioned | 2023-05-16T06:55:41Z |
dc.date.available | 2023-05-16T06:55:41Z |
dc.date.issued | 2023-05-10 |
dc.identifier.uri | http://hdl.handle.net/11356/1831 |
dc.description | The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200) combined with the Croatian hr500k training dataset (http://hdl.handle.net/11356/1792) to ensure sufficient representation of certain labels. The CLARIN.SI-embed.sr word embeddings (http://hdl.handle.net/11356/1789) were used during training. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.19. The difference to the previous version of the model is that this version was trained on the SETimes.SR corpus expanded with the Croatian hr500k training dataset to ensure sufficient representation of certain labels. it was also trained using the new version of Serbian word embeddings. |
dc.language.iso | srp |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | http://dx.doi.org/10.18653/v1/W19-3704 |
dc.relation.replaces | http://hdl.handle.net/11356/1349 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/clarinsi/classla |
dc.subject | language model |
dc.subject | part-of-speech tagging |
dc.title | The CLASSLA-Stanza model for morphosyntactic annotation of standard Serbian 2.1 |
dc.type | toolService |
metashare.ResourceInfo#ContentInfo.detailedType | tool |
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent | true |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
contact.person | Luka Terčon luka.tercon@gmail.com Faculty of Computer and Information Science, University of Ljubljana |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
sponsor | Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other |
files.count | 2 |
files.size | 188149388 |
Files in this item
Download all files in item (179.43 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- baseline_pos.zip
- Size
- 74.31 MB
- Format
- application/zip
- Description
- Language model
- MD5
- 63630c0d882181c90d35a638813c401f

- Name
- sr_set.pretrain.zip
- Size
- 105.12 MB
- Format
- application/zip
- Description
- Pretrained word embeddings
- MD5
- a5759144e1114ac72660918eeaad8dc8