Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0

Name: Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Erjavec, Tomaž; Batanović, Vuk; Miličević, Maja; Samardžić, Tanja

Show simple item record

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Erjavec, Tomaž
dc.contributor.author	Batanović, Vuk
dc.contributor.author	Miličević, Maja
dc.contributor.author	Samardžić, Tanja
dc.date.accessioned	2023-04-07T15:31:38Z
dc.date.available	2023-04-07T15:31:38Z
dc.date.issued	2023-04-07
dc.identifier.uri	http://hdl.handle.net/11356/1793
dc.description	ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). This version of the dataset has various annotation errors corrected and the dataset encoded in the CoNLL-U-Plus format, similar to other manually annotated linguistic datasets for Croatian and Serbian. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade.
dc.language.iso	hrv
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	http://dx.doi.org/10.4312/slo2.0.2016.2.156-188
dc.relation.replaces	http://hdl.handle.net/11356/1241
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://github.com/reldi-data/reldi-normtagner-hr
dc.subject	computer-mediated communication
dc.subject	tokenisation
dc.subject	word normalisation
dc.subject	part-of-speech tagging
dc.subject	lemmatisation
dc.subject	named entities
dc.subject	manual annotation
dc.subject	TEI
dc.title	Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
size.info	3871 texts
size.info	7939 sentences
size.info	89855 tokens
files.count	4
files.size	8956597