Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Samardžić, Tanja
dc.date.accessioned 2023-04-07T15:24:10Z
dc.date.available 2023-04-07T15:24:10Z
dc.date.issued 2023-04-13
dc.identifier.uri http://hdl.handle.net/11356/1792
dc.description The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, https://nl.ijs.si/ME/V6/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, (3) the Janes annotation guidelines for named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, (4) the PARSEME guidelines for annotating multi-word expressions, https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3/ and (4) the semantic role labelling annotation protocol for Slovenian and Croatian, https://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Gantar-et-al_Towards-Semantic-Role-Labeling-in-Slovene-and-Croatian.pdf. Different to the previous version of the dataset, it is now encoded in the conllup format, as are other linguistic training datasets for Croatian and Serbian. The PARSEME multi-word expression annotation layer was added as well, together with countless corrections of labels on all available levels. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade.
dc.language.iso hrv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby http://www.lrec-conf.org/proceedings/lrec2016/summaries/340.html
dc.relation.replaces http://hdl.handle.net/11356/1183
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/reldi-data/hr500k
dc.subject part-of-speech tagging
dc.subject dependency treebank
dc.subject parsing
dc.subject named entities
dc.subject tokenisation
dc.subject manual annotation
dc.subject semantic role labelling
dc.subject multiword expressions
dc.title Croatian linguistic training corpus hr500k 2.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 901 texts
size.info 24763 sentences
size.info 499635 tokens
files.count 7
files.size 52002704


 Files in this item

 Download all files in item (49.59 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
hr500k.conllup
Size
41.18 MB
Format
Unknown
Description
CoNLL-U-Plus dataset
MD5
5387c12c3d318591937601a82410287e
 Download file
Icon
Name
hr500k-train.conllu.gz
Size
4.7 MB
Format
application/gzip
Description
CoNLL-U morphosyntax training dataset
MD5
9862c2524f74023dbfce08445fcaaea3
 Download file
Icon
Name
hr500k-dev.conllu.gz
Size
617.71 KB
Format
application/gzip
Description
CoNLL-U morphosyntax development dataset
MD5
931f619cd81a122fe298ca94a1f56209
 Download file
Icon
Name
hr500k-test.conllu.gz
Size
647.91 KB
Format
application/gzip
Description
CoNLL-U morphosyntax test dataset
MD5
164174dceba5b0d893e8cf0825a3279b
 Download file
Icon
Name
hr_set-ud-train.conllu.gz
Size
1.9 MB
Format
application/gzip
Description
CoNLL-U dependency syntax training dataset
MD5
d36eed3a0254cfaa78a2161228af2199
 Download file
Icon
Name
hr_set-ud-dev.conllu.gz
Size
284.19 KB
Format
application/gzip
Description
CoNLL-U dependency syntax development dataset
MD5
98df55eb4d34113cfd5925d643c1f975
 Download file
Icon
Name
hr_set-ud-test.conllu.gz
Size
310.52 KB
Format
application/gzip
Description
CoNLL-U dependency syntax test dataset
MD5
68a7cb699066f5e113830233ce6c994b
 Download file

Show simple item record