dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Farkaš, Daša |
dc.contributor.author | Klubička, Filip |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Miličević, Maja |
dc.contributor.author | Filko, Matea |
dc.contributor.author | Kranjčić, Denis |
dc.contributor.author | Dujmić, Barbara |
dc.date.accessioned | 2017-04-04T07:59:06Z |
dc.date.available | 2017-04-04T07:59:06Z |
dc.date.issued | 2017-04-04 |
dc.identifier.uri | http://hdl.handle.net/11356/1095 |
dc.description | ReLDI-NormTag-hr 1.0 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). The corpus construction is (partially) described in: MILIČEVIĆ, Maja, LJUBEŠIĆ, Nikola. Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets. Slovenščina 2.0: empirical, applied and interdisciplinary research, 4/2, 2016. ISSN 2335-2736. http://dx.doi.org/10.4312/slo2.0.2016.2.156-188 |
dc.language.iso | hrv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1121 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.subject | computer-mediated communication |
dc.subject | tokenisation |
dc.subject | word normalisation |
dc.subject | tagging |
dc.subject | lemmatisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.title | Croatian Twitter training corpus ReLDI-NormTag-hr 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hidden | hidden |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | Swiss National Science Foundation 160501 ReLDI Other |
sponsor | ARRS (Slovenian Research Agency) J6-6842 JANES: Resources, Tools and Methods for the Research of Nonstandard Internet Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
size.info | 89101 tokens |
size.info | 3871 texts |
files.count | 3 |
files.size | 2161129 |
Files in this item
Download all files in item (2.06 MB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)



- Name
- ReLDI-hr.zip
- Size
- 1.14 MB
- Format
- application/zip
- Description
- Corpus in TEI format
- MD5
- f5b6623ecfd44fdd2ab33bb8f48f6331
- schema
- tei_janes_doc.html2 MB
- tei_janes.rng399 kB
- tei_janes_schema.xml2 kB
- tei_janes.zip44 kB
- tei_janes.rnc188 kB
- reldi-hr.body.xml6 MB
- msd-fslib-bs.xml82 kB
- reldi-hr.xml5 kB

- Name
- ReLDI-hr.vert.zip
- Size
- 684.65 KB
- Format
- application/zip
- Description
- Derived corpus in vertical format
- MD5
- a9b392b4bc05ab1187be3cd712b9b6cc

- Name
- ReLDI-NormTag-Guidelines.pdf
- Size
- 261.5 KB
- Format
- Description
- Annotation guidelines (in Serbo-Croatian)
- MD5
- 237ec14a7885af2b6d7cd3e3853ec70a