What's New

 corpus 
corpus
Description:
Janes-Tag is a manually annotated corpus of Slovene Computer-Mediated Communication (CMC). It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic ...
 This item contains 7 files (3.13 MB).
 
Publicly Available Distributed under Creative Commons Attribution Required Share Alike
 corpus 
corpus
Description:
ReLDI-NormTag-sr 1.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and ...
 This item contains 3 files (2 MB).
 
Publicly Available Distributed under Creative Commons Attribution Required
 corpus 
corpus
Description:
ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and ...
 This item contains 3 files (2.06 MB).
 
Publicly Available Distributed under Creative Commons Attribution Required

Most Viewed Items

Top Last Week
 lexicalConceptualResource 
lexicalConceptualResource
Description:
A lexicon of 751 emoji characters with automatically assigned sentiment. The sentiment is computed from 70,000 tweets, labeled by 83 human annotators in 13 European languages. The process and analysis of emoji sentiment ...
 This item contains 3 files (93.95 KB).
 
Publicly Available Distributed under Creative Commons Attribution Required Share Alike
 lexicalConceptualResource 
lexicalConceptualResource
Description:
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of the word; (2) the lemma, the base-form of the ...
 This item contains 12 files (16.27 MB).
 
Publicly Available Distributed under Creative Commons Attribution Required Share Alike
 corpus 
corpus
Description:
The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated ...
 This item contains 15 files (9.21 GB).
 
Publicly Available Distributed under Creative Commons Attribution Required Share Alike