Croatian linguistic training corpus hr500k 2.0

Name: Croatian linguistic training corpus hr500k 2.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Samardžić, Tanja

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Samardžić, Tanja
dc.date.accessioned	2023-04-07T15:24:10Z
dc.date.available	2023-04-07T15:24:10Z
dc.date.issued	2023-04-13
dc.identifier.uri	http://hdl.handle.net/11356/1792
dc.description	The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, https://nl.ijs.si/ME/V6/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, (3) the Janes annotation guidelines for named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, (4) the PARSEME guidelines for annotating multi-word expressions, https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3/ and (4) the semantic role labelling annotation protocol for Slovenian and Croatian, https://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Gantar-et-al_Towards-Semantic-Role-Labeling-in-Slovene-and-Croatian.pdf. Different to the previous version of the dataset, it is now encoded in the conllup format, as are other linguistic training datasets for Croatian and Serbian. The PARSEME multi-word expression annotation layer was added as well, together with countless corrections of labels on all available levels. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade.
dc.language.iso	hrv
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	http://www.lrec-conf.org/proceedings/lrec2016/summaries/340.html
dc.relation.replaces	http://hdl.handle.net/11356/1183
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://github.com/reldi-data/hr500k
dc.subject	part-of-speech tagging
dc.subject	dependency treebank
dc.subject	parsing
dc.subject	named entities
dc.subject	tokenisation
dc.subject	manual annotation
dc.subject	semantic role labelling
dc.subject	multiword expressions
dc.title	Croatian linguistic training corpus hr500k 2.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info	901 texts
size.info	24763 sentences
size.info	499635 tokens
files.count	7
files.size	52002704