2026-06-14T13:02:07Zhttp://www.clarin.si/repository/oai/request

oai:www.clarin.si:11356/17922024-11-01T09:11:34Zhdl_11356_1023hdl_11356_1024

Croatian linguistic training corpus hr500k 2.0 Ljubešić, Nikola Samardžić, Tanja part-of-speech tagging dependency treebank parsing named entities tokenisation manual annotation semantic role labelling multiword expressions The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, https://nl.ijs.si/ME/V6/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, (3) the Janes annotation guidelines for named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, (4) the PARSEME guidelines for annotating multi-word expressions, https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3/ and (4) the semantic role labelling annotation protocol for Slovenian and Croatian, https://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Gantar-et-al_Towards-Semantic-Role-Labeling-in-Slovene-and-Croatian.pdf. Different to the previous version of the dataset, it is now encoded in the conllup format, as are other linguistic training datasets for Croatian and Serbian. The PARSEME multi-word expression annotation layer was added as well, together with countless corrections of labels on all available levels. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade. 2023-04-13 corpus http://hdl.handle.net/11356/1792 hrv http://www.lrec-conf.org/proceedings/lrec2016/summaries/340.html http://hdl.handle.net/11356/1183 Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) https://creativecommons.org/licenses/by-sa/4.0/ PUB text/plain; charset=utf-8 application/octet-stream application/gzip application/gzip application/gzip application/gzip application/gzip application/gzip downloadable_files_count: 7 Jožef Stefan Institute https://github.com/reldi-data/hr500k