Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0

Name: Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Erjavec, Tomaž; Batanović, Vuk; Miličević, Maja; Samardžić, Tanja

Show simple item record

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Erjavec, Tomaž
dc.contributor.author	Batanović, Vuk
dc.contributor.author	Miličević, Maja
dc.contributor.author	Samardžić, Tanja
dc.date.accessioned	2023-04-07T15:38:52Z
dc.date.available	2023-04-07T15:38:52Z
dc.date.issued	2023-04-07
dc.identifier.uri	http://hdl.handle.net/11356/1794
dc.description	ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). This version of the dataset has various annotation mistakes corrected, and is now encoded in the CoNLL-U-Plus format, as are other linguistic training datasets for Croatian and Serbian. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade.
dc.language.iso	srp
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	http://dx.doi.org/10.4312/slo2.0.2016.2.156-188
dc.relation.replaces	http://hdl.handle.net/11356/1240
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://github.com/reldi-data/reldi-normtagner-sr
dc.subject	computer-mediated communication
dc.subject	tokenisation
dc.subject	word normalisation
dc.subject	part-of-speech tagging
dc.subject	lemmatisation
dc.subject	named entities
dc.subject	manual annotation
dc.subject	TEI
dc.title	Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
size.info	3748 texts
size.info	6899 sentences
size.info	92271 tokens
files.count	4
files.size	9233057