dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Batanović, Vuk |
dc.contributor.author | Miličević, Maja |
dc.contributor.author | Samardžić, Tanja |
dc.date.accessioned | 2023-04-07T15:38:52Z |
dc.date.available | 2023-04-07T15:38:52Z |
dc.date.issued | 2023-04-07 |
dc.identifier.uri | http://hdl.handle.net/11356/1794 |
dc.description | ReLDI-NormTagNER-sr 3.0 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). This version of the dataset has various annotation mistakes corrected, and is now encoded in the CoNLL-U-Plus format, as are other linguistic training datasets for Croatian and Serbian. The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade. |
dc.language.iso | srp |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | http://dx.doi.org/10.4312/slo2.0.2016.2.156-188 |
dc.relation.replaces | http://hdl.handle.net/11356/1240 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://github.com/reldi-data/reldi-normtagner-sr |
dc.source.uri | https://reldi.rs/ |
dc.subject | computer-mediated communication |
dc.subject | tokenisation |
dc.subject | word normalisation |
dc.subject | part-of-speech tagging |
dc.subject | lemmatisation |
dc.subject | named entities |
dc.subject | manual annotation |
dc.subject | TEI |
dc.title | Serbian Twitter training corpus ReLDI-NormTagNER-sr 3.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
size.info | 3748 texts |
size.info | 6899 sentences |
size.info | 92271 tokens |
files.count | 4 |
files.size | 9233057 |
Files in this item
Download all files in item (8.81 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- reldi-normtagner-sr.conllup
- Size
- 7.75 MB
- Format
- Unknown
- Description
- CoNLL-U-Plus dataset
- MD5
- c6f3799e38de1209e6bd6b59e888ec11

- Name
- reldi-normtagner-sr-train.conllu.gz
- Size
- 868.31 KB
- Format
- application/gzip
- Description
- CoNLL-U morphosyntax training dataset
- MD5
- eb8e4339d01d2c301c1a038e59fe74c4

- Name
- reldi-normtagner-sr-dev.conllu.gz
- Size
- 109.69 KB
- Format
- application/gzip
- Description
- CoNLL-U morphosyntax development dataset
- MD5
- a2a03ada14d4a914c70e497e169ea28b

- Name
- reldi-normtagner-sr-test.conllu.gz
- Size
- 107.73 KB
- Format
- application/gzip
- Description
- CoNLL-U morphosyntax test dataset
- MD5
- a8010fe58b49afd411a9b53c8fe03346