What's New
corpus
Description:
This is a large-scale multilingual benchmark for evaluating metalinguistic knowledge (i.e. explicit knowledge about the structure of languages) in large language models using grammatical features from the World Atlas of ...
This item contains 1 file (1.75
MB).
Publicly Available
corpus
Description:
The Corpus-grounded evaluation dataset for grammatical question answering (GramQA) consists of 13 grammatical questions inspired by WALS, the World Atlas of Language Structures (https://wals.info/), focusing on word order ...
This item contains 1 file (38.04
KB).
Publicly Available
corpus
Description:
The multilingual training dataset for CAP policy topic classification ParlaCAP-train is a collection of parliamentary speeches in 29 European languages, automatically annotated with 21 major policy topic labels from the ...
This item contains 1 file (19.52
MB).
Publicly Available
Most Viewed Items
Top Last Week
corpus
Description:
ParlaMint 5.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora ...
This item contains 31 files (5.94
GB).
Publicly Available
corpus
Description:
The CLASSLA-web 2.0 collection is a large-scale, comparable set of web corpora covering all seven South Slavic languages: Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian. This second major ...
This item contains 29 files (455.27
GB).
Publicly Available
corpus
Description:
goo300k is a manually annotated reference corpus of historical Slovene. It contains 1,100 pages (about 300,000 tokens) sampled from 89 texts from the period 1584-1899.
Each text contains extensive meta-data and per-page ...
This item contains 2 files (8.9
MB).
Publicly Available