What's New

 corpus 
corpus
Description:
The SlovParl corpus contains minutes of the Assembly of the Republic of Slovenia for the legislative period 1990-1992, i.e. it covers the period before, during, and after Slovenia became an independent country in 1991. The ...
 This item contains 3 files (169.71 MB).
 
Publicly Available Distributed under Creative Commons Attribution Required
 corpus 
corpus
Description:
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. About half of the corpus is also manually ...
 This item contains 3 files (49.63 MB).
 
Publicly Available Distributed under Creative Commons Attribution Required Noncommercial Share Alike
 corpus 
corpus
Description:
Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching (the use of words from two or more languages within one sentence or utterance), according to the supplied typology. Words in the ...
 This item contains 4 files (1.28 MB).
 
Publicly Available Distributed under Creative Commons Attribution Required Share Alike

Most Viewed Items

Top Last Week
 lexicalConceptualResource 
lexicalConceptualResource
Description:
A lexicon of 751 emoji characters with automatically assigned sentiment. The sentiment is computed from 70,000 tweets, labeled by 83 human annotators in 13 European languages. The process and analysis of emoji sentiment ...
 This item contains 3 files (93.95 KB).
 
Publicly Available Distributed under Creative Commons Attribution Required Share Alike
 corpus 
corpus
Description:
The Croatian web corpus hrWaC was built by crawling the .hr top-level domain in 2011 and again in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated ...
 This item contains 15 files (9.21 GB).
 
Publicly Available Distributed under Creative Commons Attribution Required Share Alike
 corpus 
corpus
Description:
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original (about 100,000 words in length), and its translations ...
 This item contains 1 file (14.12 MB).
 
Publicly Available Distributed under Creative Commons Attribution Required Noncommercial Share Alike