dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Batanović, Vuk |
dc.contributor.author | Miličević, Maja |
dc.contributor.author | Samardžić, Tanja |
dc.date.accessioned | 2023-04-07T15:31:38Z |
dc.date.available | 2023-04-07T15:31:38Z |
dc.date.issued | 2023-04-07 |
dc.identifier.uri | http://hdl.handle.net/11356/1793 |
dc.description | ReLDI-NormTagNER-hr 3.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). This version of the dataset has various annotation errors corrected and the dataset encoded in the CoNLL-U-Plus format, similar to other manually annotated linguistic datasets for Croatian and Serbian. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade. |
dc.language.iso | hrv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | http://dx.doi.org/10.4312/slo2.0.2016.2.156-188 |
dc.relation.replaces | http://hdl.handle.net/11356/1241 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/reldi-data/reldi-normtagner-hr |
dc.subject | computer-mediated communication |
dc.subject | tokenisation |
dc.subject | word normalisation |
dc.subject | part-of-speech tagging |
dc.subject | lemmatisation |
dc.subject | named entities |
dc.subject | manual annotation |
dc.subject | TEI |
dc.title | Croatian Twitter training corpus ReLDI-NormTagNER-hr 3.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
size.info | 3871 texts |
size.info | 7939 sentences |
size.info | 89855 tokens |
files.count | 4 |
files.size | 8956597 |
Files in this item
Download all files in item (8.54 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- reldi-normtagner-hr.conllup
- Size
- 7.47 MB
- Format
- Unknown
- Description
- CoNLL-U-Plus dataset
- MD5
- 16e73df1872f953b9e8be7bb6e48671c

- Name
- reldi-normtagner-hr-train.conllu.gz
- Size
- 875.55 KB
- Format
- application/gzip
- Description
- CoNLL-U morphosyntax training dataset
- MD5
- 8461e7854d7443a3e046679f575b6fe1

- Name
- reldi-normtagner-hr-dev.conllu.gz
- Size
- 109.12 KB
- Format
- application/gzip
- Description
- CoNLL-U morphosyntax development dataset
- MD5
- 87220592f73b929a9b5909ea97be6250

- Name
- reldi-normtagner-hr-test.conllu.gz
- Size
- 110.76 KB
- Format
- application/gzip
- Description
- CoNLL-U morphosyntax test dataset
- MD5
- 6df74d4b761900f7369072983dff3f2e