What's New
toolService
Description:
This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank of spoken Slovenian (UD v2.15, https://gith ...
This item contains 1 file (138.81
MB).
Publicly Available
toolService
Description:
This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian ...
This item contains 1 file (145.51
MB).
Publicly Available
corpus
Description:
The multilingual IPTC Media Topic dataset EMMediaTopic 1.0 is a collection of news articles in Catalan, Croatian, Greek, and Slovenian, automatically annotated with the 17 top-level topic labels from the IPTC NewsCodes ...
This item contains 1 file (71.3
MB).
Publicly Available
Most Viewed Items
Top Last Week
corpus
Description:
ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora ...
This item contains 30 files (5.87
GB).
Publicly Available
corpus
Description:
The Serbian web corpus srWaC was built by crawling the .rs top-level domain in 2014. The corpus was near-deduplicated on paragraph level, normalised via diacritic restoration, morphosyntactically annotated and lemmatised. ...
This item contains 6 files (3.51
GB).
Publicly Available
corpus
Description:
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original (about 100,000 words in length), and its translations ...
This item contains 1 file (14.12
MB).
Academic Use