Serbian linguistic training corpus SETimes.SR 2.0

Name: Serbian linguistic training corpus SETimes.SR 2.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Batanović, Vuk; Ljubešić, Nikola; Samardžić, Tanja; Erjavec, Tomaž

Show simple item record

dc.contributor.author	Batanović, Vuk
dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Samardžić, Tanja
dc.contributor.author	Erjavec, Tomaž
dc.date.accessioned	2023-07-22T14:28:21Z
dc.date.available	2023-07-22T14:28:21Z
dc.date.issued	2023-06-13
dc.identifier.uri	http://hdl.handle.net/11356/1843
dc.description	The SETimes.SR training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, http://nl.ijs.si/ME/V6/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, and (3) the Janes annotation guidelines for named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf. The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade.
dc.language.iso	srp
dc.publisher	Regional Linguistic Data Initiative Centre ReLDI
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	http://www.aclweb.org/anthology/W17-1407
dc.relation.replaces	http://hdl.handle.net/11356/1200
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://github.com/reldi-data/SETimes.SRPlus
dc.subject	part-of-speech tagging
dc.subject	dependency treebank
dc.subject	parsing
dc.subject	named entities
dc.subject	tokenisation
dc.subject	manual annotation
dc.subject	TEI
dc.title	Serbian linguistic training corpus SETimes.SR 2.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	Swiss National Science Foundation 160501 ReLDI Other
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	176 texts
size.info	4384 sentences
size.info	97673 tokens
files.count	4
files.size	9861171