What's New
corpus
Description:
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 76 publishers. Trendi 2024-08 covers the period from January 2019 to September 2024, complementing the ...
This item contains no files.
corpus
Description:
GaMS-Instruct-DH is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions. It consists of pairs of prompts and responses, some of which contain an additional context ...
This item contains 1 file (888.96
KB).
Publicly Available
corpus
Description:
GaMS-Instruct-GEN is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions. It consists of pairs of prompts and responses, some of which contain an additional input ...
This item contains 1 file (3.12
MB).
Publicly Available
Most Viewed Items
Top Last Week
corpus
Description:
The Montenegrin web corpus MaCoCu-cnr 1.0 was built by crawling the ".me" internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The crawler is available at https://github.c ...
This item contains 2 files (500.14
MB).
Publicly Available
corpus
Description:
ParlaMint 4.1 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora ...
This item contains 30 files (5.87
GB).
Publicly Available
corpus
Description:
The CVET corpus contains 230 texts (around 175 thousand words) of varying length, published in the religious journal "Cvetje z vertov sv. Frančiška" between 1887 and 1916, when the magazine was edited by the linguist Fr. ...
This item contains 4 files (15.02
MB).
Publicly Available