The CLASSLA-Stanza model for morphosyntactic annotation of standard Serbian 2.1

Name: The CLASSLA-Stanza model for morphosyntactic annotation of standard Serbian 2.1
License: https://creativecommons.org/licenses/by-sa/4.0/

Terčon, Luka; Ljubešić, Nikola

Show simple item record

dc.contributor.author	Terčon, Luka
dc.contributor.author	Ljubešić, Nikola
dc.date.accessioned	2023-05-16T06:55:41Z
dc.date.available	2023-05-16T06:55:41Z
dc.date.issued	2023-05-10
dc.identifier.uri	http://hdl.handle.net/11356/1831
dc.description	The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200) combined with the Croatian hr500k training dataset (http://hdl.handle.net/11356/1792) to ensure sufficient representation of certain labels. The CLARIN.SI-embed.sr word embeddings (http://hdl.handle.net/11356/1789) were used during training. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.19. The difference to the previous version of the model is that this version was trained on the SETimes.SR corpus expanded with the Croatian hr500k training dataset to ensure sufficient representation of certain labels. it was also trained using the new version of Serbian word embeddings.
dc.language.iso	srp
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	http://dx.doi.org/10.18653/v1/W19-3704
dc.relation.replaces	http://hdl.handle.net/11356/1349
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://github.com/clarinsi/classla
dc.subject	language model
dc.subject	part-of-speech tagging
dc.title	The CLASSLA-Stanza model for morphosyntactic annotation of standard Serbian 2.1
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
contact.person	Luka Terčon luka.tercon@gmail.com Faculty of Computer and Information Science, University of Ljubljana
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor	Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
files.count	2
files.size	188149388