<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
<title>CLARIN.SI data &amp; tools</title>
<link href="http://hdl.handle.net/11356/1024" rel="alternate"/>
<subtitle>CLARIN.SI repository language resources and tools</subtitle>
<id>http://hdl.handle.net/11356/1024</id>
<updated>2026-05-07T00:55:27Z</updated>
<dc:date>2026-05-07T00:55:27Z</dc:date>
<entry>
<title>Monitor corpus of Slovene Trendi 2026-04</title>
<link href="http://hdl.handle.net/11356/2154" rel="alternate"/>
<author>
<name>Kosem, Iztok</name>
</author>
<author>
<name>Čibej, Jaka</name>
</author>
<author>
<name>Dobrovoljc, Kaja</name>
</author>
<author>
<name>Erjavec, Tomaž</name>
</author>
<author>
<name>Ljubešić, Nikola</name>
</author>
<author>
<name>Ponikvar, Primož</name>
</author>
<author>
<name>Šinkec, Mihael</name>
</author>
<author>
<name>Krek, Simon</name>
</author>
<id>http://hdl.handle.net/11356/2154</id>
<updated>2026-05-06T15:17:52Z</updated>
<published>2026-05-06T00:00:00Z</published>
<summary type="text">Monitor corpus of Slovene Trendi 2026-04
Kosem, Iztok; Čibej, Jaka; Dobrovoljc, Kaja; Erjavec, Tomaž; Ljubešić, Nikola; Ponikvar, Primož; Šinkec, Mihael; Krek, Simon
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 61 publishers. Trendi 2026-04 covers the period from January 2019 to April 2026, complementing the Gigafida 2.2 reference corpus of written Slovene (http://hdl.handle.net/11356/2106).&#13;
&#13;
The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).&#13;
&#13;
An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).&#13;
&#13;
The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem (iztok.kosem@ijs.si).&#13;
&#13;
This version adds texts from April 2026.
</summary>
<dc:date>2026-05-06T00:00:00Z</dc:date>
</entry>
<entry>
<title>Disasters corpus in classical Arabic sources DiCCAS</title>
<link href="http://hdl.handle.net/11356/2097" rel="alternate"/>
<author>
<name>Cicola, Ilaria</name>
</author>
<author>
<name>Pannitto, Ludovica</name>
</author>
<author>
<name>Peta, Ines</name>
</author>
<author>
<name>Fontana, Chiara</name>
</author>
<author>
<name>Norozi, Nahid</name>
</author>
<author>
<name>Demichelis, Marco</name>
</author>
<author>
<name>Aiello, Giulia</name>
</author>
<id>http://hdl.handle.net/11356/2097</id>
<updated>2026-04-21T14:16:23Z</updated>
<published>2026-03-12T00:00:00Z</published>
<summary type="text">Disasters corpus in classical Arabic sources DiCCAS
Cicola, Ilaria; Pannitto, Ludovica; Peta, Ines; Fontana, Chiara; Norozi, Nahid; Demichelis, Marco; Aiello, Giulia
The Disasters corpus in classical Arabic sources (DiCCAS) is designed to allow historians to compare different accounts and narratives of disasters in a variety of classical Arabic sources. &#13;
&#13;
The corpus encompasses a diverse range of materials, including the Qur’an and the ḥadīth collections Saḥīḥ al-Bukhārī and Saḥīḥ Muslim, as well as several significant historical works, such as al-Ṭabarī’s Kitāb Tārīkh al-rusul wa-l-mulūk and Ibn Taghrībirdī’s Kitāb al-Nujūm al-zāhira fī mulūk Miṣr wa-l-Qāhira. The corpus also incorporates adab texts by al-Jāhiẓ, notably his Rasāʾil, and Ibn al-Jawzī’s al-Mudhish. &#13;
&#13;
The DiCCAS corpus is encoded using the Text Encoding Initiative (TEI) Guidelines which gives the structure of the corpus and marks disaster related words. It is also available in vertical format, which adds linguistic annotations, i.e. tokenisation, lemmatisations and PoS tagging.
</summary>
<dc:date>2026-03-12T00:00:00Z</dc:date>
</entry>
<entry>
<title>Slovene Lexicographic QA Fine-Tuning Corpus SloLexQA 1.0</title>
<link href="http://hdl.handle.net/11356/2116" rel="alternate"/>
<author>
<name>Knez, Timotej</name>
</author>
<author>
<name>Žitnik, Slavko</name>
</author>
<id>http://hdl.handle.net/11356/2116</id>
<updated>2026-04-14T13:07:06Z</updated>
<published>2026-04-14T00:00:00Z</published>
<summary type="text">Slovene Lexicographic QA Fine-Tuning Corpus SloLexQA 1.0
Knez, Timotej; Žitnik, Slavko
The Slovene Lexicographic QA Fine-Tuning Corpus is a specialized dataset designed to advance the performance of AI models in understanding the structural, grammatical, and semantic nuances of the Slovene language. Comprising over 16,000 question-answer pairs, the corpus shifts away from general knowledge to focus on high-quality lexicographic data, including morphology, lemmatization, and part-of-speech identification. It serves as a critical resource for fine-tuning models to act as sophisticated linguistic assistants.&#13;
&#13;
The dataset integrates diverse sources, ranging from automatically generated content based on the Digital Dictionary Database of Slovene (DDDS) to manual expert advice from the Jezikovna svetovalnica portal. This hybrid approach ensures a robust mix of systematic grammatical queries and nuanced, real-world linguistic explanations. With a significant portion of the data derived from annotated linguistic corpora like SSJ500k, the dataset provides a reliable foundation for training models in both context-free definitions and context-dependent usage scenarios.&#13;
&#13;
Technically, the corpus is structured for high utility in machine learning workflows, featuring a 90/10 training and test split with metadata for each entry. It categorizes questions into specific types such as definitions and usage examples, allowing researchers to perform targeted domain adaptation. By providing clear links between questions and specific lexemes, the corpus enables precise evaluation of a model's ability to navigate the formal rules and practical applications of the Slovene lexicon.
</summary>
<dc:date>2026-04-14T00:00:00Z</dc:date>
</entry>
<entry>
<title>Monitor corpus of Slovene Trendi 2026-03</title>
<link href="http://hdl.handle.net/11356/2103" rel="alternate"/>
<author>
<name>Kosem, Iztok</name>
</author>
<author>
<name>Čibej, Jaka</name>
</author>
<author>
<name>Dobrovoljc, Kaja</name>
</author>
<author>
<name>Erjavec, Tomaž</name>
</author>
<author>
<name>Ljubešić, Nikola</name>
</author>
<author>
<name>Ponikvar, Primož</name>
</author>
<author>
<name>Šinkec, Mihael</name>
</author>
<author>
<name>Krek, Simon</name>
</author>
<id>http://hdl.handle.net/11356/2103</id>
<updated>2026-05-06T15:18:14Z</updated>
<published>2026-04-02T00:00:00Z</published>
<summary type="text">Monitor corpus of Slovene Trendi 2026-03
Kosem, Iztok; Čibej, Jaka; Dobrovoljc, Kaja; Erjavec, Tomaž; Ljubešić, Nikola; Ponikvar, Primož; Šinkec, Mihael; Krek, Simon
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 60 publishers. Trendi 2026-02 covers the period from January 2019 to March 2026, complementing the Gigafida 2.2 reference corpus of written Slovene (http://hdl.handle.net/11356/2106).&#13;
&#13;
The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).&#13;
&#13;
An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).&#13;
&#13;
The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem (iztok.kosem@ijs.si).&#13;
&#13;
This version adds texts from March 2026.
</summary>
<dc:date>2026-04-02T00:00:00Z</dc:date>
</entry>
<entry>
<title>Corpus of written standard Slovene Gigafida 2.2</title>
<link href="http://hdl.handle.net/11356/2106" rel="alternate"/>
<author>
<name>Krek, Simon</name>
</author>
<author>
<name>Erjavec, Tomaž</name>
</author>
<author>
<name>Repar, Andraž</name>
</author>
<author>
<name>Čibej, Jaka</name>
</author>
<author>
<name>Arhar Holdt, Špela</name>
</author>
<author>
<name>Gantar, Polona</name>
</author>
<author>
<name>Kosem, Iztok</name>
</author>
<author>
<name>Robnik-Šikonja, Marko</name>
</author>
<author>
<name>Ljubešić, Nikola</name>
</author>
<author>
<name>Dobrovoljc, Kaja</name>
</author>
<author>
<name>Laskowski, Cyprian</name>
</author>
<author>
<name>Grčar, Miha</name>
</author>
<author>
<name>Holozan, Peter</name>
</author>
<author>
<name>Šuster, Simon</name>
</author>
<author>
<name>Gorjanc, Vojko</name>
</author>
<author>
<name>Stabej, Marko</name>
</author>
<author>
<name>Logar, Nataša</name>
</author>
<author>
<name>Terčon, Luka</name>
</author>
<author>
<name>Škvorc, Tadej</name>
</author>
<id>http://hdl.handle.net/11356/2106</id>
<updated>2026-04-02T14:51:00Z</updated>
<published>2025-12-08T00:00:00Z</published>
<summary type="text">Corpus of written standard Slovene Gigafida 2.2
Krek, Simon; Erjavec, Tomaž; Repar, Andraž; Čibej, Jaka; Arhar Holdt, Špela; Gantar, Polona; Kosem, Iztok; Robnik-Šikonja, Marko; Ljubešić, Nikola; Dobrovoljc, Kaja; Laskowski, Cyprian; Grčar, Miha; Holozan, Peter; Šuster, Simon; Gorjanc, Vojko; Stabej, Marko; Logar, Nataša; Terčon, Luka; Škvorc, Tadej
Gigafida 2.2 is a reference corpus of written Slovene texts published in the period 1990-2018. It is comprised of daily news, magazines, a selection of web texts (a certain portion of which covers news texts as well), and different types of publications (fiction, school books, and non-fiction). The texts have been selected and automatically processed with the aim of creating a corpus that represents a sample of modern standard Slovene and can be used for research in linguistics and other branches of the humanities, for compiling modern dictionaries, grammars, and learning materials, as well as for developing language technologies for Slovene.&#13;
&#13;
The main novelty of version 2.2 is the segmentation of texts from the newspapers Delo and Dnevnik, which represent the largest share of newspaper texts in the corpus. In version 2.1, these texts contained entire daily editions of newspapers with articles on various topics. Using a combination of automatic and manual methods, we segmented the editions into individual articles. In this way, the Gigafida corpus is also better prepared for future upgrades, as newer journalistic texts (e.g., the ones collected for the monitor corpus Trendi) are already being collected in the form of individual articles. A few other improvements have been made, e.g. invalid tags in several files which caused for the texts to be excluded from the corpus when uploaded to the concordances. On the other hand, Gigafida 2.2 does not contain Semantic role labels and Named Entity annotations.&#13;
&#13;
References:&#13;
Simon Krek, Špela Arhar Holdt, Tomaž Erjavec, Jaka Čibej, Andraz Repar, Polona Gantar, Nikola Ljubešić, Iztok Kosem and Kaja Dobrovoljc. Gigafida 2.0: The Reference Corpus of Written Standard Slovene. Proceedings of The 12th Language Resources and Evaluation Conference. Marseille, May 2020. https://www.aclweb.org/anthology/2020.lrec-1.409/&#13;
&#13;
LOGAR BERGINC, Nataša, GRČAR, Miha, BRAKUS, Marko, ERJAVEC, Tomaž, ARHAR HOLDT, Špela and KREK, Simon. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede, 2012. https://doi.org/10.4312/9789610603542
</summary>
<dc:date>2025-12-08T00:00:00Z</dc:date>
</entry>
<entry>
<title>Corpus of written standard Slovene Gigafida 2.1</title>
<link href="http://hdl.handle.net/11356/2055" rel="alternate"/>
<author>
<name>Krek, Simon</name>
</author>
<author>
<name>Erjavec, Tomaž</name>
</author>
<author>
<name>Repar, Andraž</name>
</author>
<author>
<name>Čibej, Jaka</name>
</author>
<author>
<name>Arhar Holdt, Špela</name>
</author>
<author>
<name>Gantar, Polona</name>
</author>
<author>
<name>Kosem, Iztok</name>
</author>
<author>
<name>Robnik-Šikonja, Marko</name>
</author>
<author>
<name>Ljubešić, Nikola</name>
</author>
<author>
<name>Dobrovoljc, Kaja</name>
</author>
<author>
<name>Laskowski, Cyprian</name>
</author>
<author>
<name>Grčar, Miha</name>
</author>
<author>
<name>Holozan, Peter</name>
</author>
<author>
<name>Šuster, Simon</name>
</author>
<author>
<name>Gorjanc, Vojko</name>
</author>
<author>
<name>Stabej, Marko</name>
</author>
<author>
<name>Logar, Nataša</name>
</author>
<id>http://hdl.handle.net/11356/2055</id>
<updated>2026-04-02T14:49:57Z</updated>
<published>2023-08-03T00:00:00Z</published>
<summary type="text">Corpus of written standard Slovene Gigafida 2.1
Krek, Simon; Erjavec, Tomaž; Repar, Andraž; Čibej, Jaka; Arhar Holdt, Špela; Gantar, Polona; Kosem, Iztok; Robnik-Šikonja, Marko; Ljubešić, Nikola; Dobrovoljc, Kaja; Laskowski, Cyprian; Grčar, Miha; Holozan, Peter; Šuster, Simon; Gorjanc, Vojko; Stabej, Marko; Logar, Nataša
Gigafida 2.1 is a reference corpus of written Slovene texts published in the period 1990-2018. It is comprised of daily news, magazines, a selection of web texts (a certain portion of which covers news texts as well), and different types of publications (fiction, school books, and non-fiction). The texts have been selected and automatically processed with the aim of creating a corpus that represents a sample of modern standard Slovene and can be used for research in linguistics and other branches of the humanities, for compiling modern dictionaries, grammars, and learning materials, as well as for developing language technologies for Slovene.&#13;
&#13;
Version 2.1 contains the same texts as version 2.0, but includes four additional annotation layers: (1) syntactic dependency annotations based on the Universal Dependencies system (https://universaldependencies.org/); (2) syntactic dependency annotations based on the JOS system; (3) semantic role labelling annotations; (4) named entity annotations. Semantic role labels were assigned with "bilateral-srl" (https://github.com/clarinsi/bilateral-srl), named entities with "The CLASSLA-StanfordNLP model for named entity recognition of standard Slovenian 1.0" (http://hdl.handle.net/11356/1321"), and syntactic dependency annotations (both UD and JOS) with "Parser-V3" - a predecessor of Stanza (https://pypi.org/project/stanza) and CLASSLA (https://pypi.org/project/classla).&#13;
&#13;
References:&#13;
Simon Krek, Špela Arhar Holdt, Tomaž Erjavec, Jaka Čibej, Andraz Repar, Polona Gantar, Nikola Ljubešić, Iztok Kosem and Kaja Dobrovoljc. Gigafida 2.0: The Reference Corpus of Written Standard Slovene. Proceedings of The 12th Language Resources and Evaluation Conference. Marseille, May 2020. https://www.aclweb.org/anthology/2020.lrec-1.409/&#13;
&#13;
LOGAR BERGINC, Nataša, GRČAR, Miha, BRAKUS, Marko, ERJAVEC, Tomaž, ARHAR HOLDT, Špela and KREK, Simon. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede, 2012. https://doi.org/10.4312/9789610603542
</summary>
<dc:date>2023-08-03T00:00:00Z</dc:date>
</entry>
<entry>
<title>Slovenian legal natural language inference dataset SLawNLI</title>
<link href="http://hdl.handle.net/11356/2100" rel="alternate"/>
<author>
<name>Malenšek, Miha</name>
</author>
<author>
<name>Krajnc, Saša</name>
</author>
<author>
<name>Križnar, Primož</name>
</author>
<author>
<name>Završnik, Aleš</name>
</author>
<author>
<name>Bajec, Marko</name>
</author>
<author>
<name>Žitnik, Slavko</name>
</author>
<id>http://hdl.handle.net/11356/2100</id>
<updated>2026-03-20T12:19:46Z</updated>
<published>2026-03-19T00:00:00Z</published>
<summary type="text">Slovenian legal natural language inference dataset SLawNLI
Malenšek, Miha; Krajnc, Saša; Križnar, Primož; Završnik, Aleš; Bajec, Marko; Žitnik, Slavko
SLawNLI is a human-annotated dataset for Natural Language Inference (NLI) in the Slovenian legal domain. It contains 2,214 examples constructed according to the standard NLI schema (premise, hypothesis, label). The dataset was annotated by four master's students of the Faculty of Law. All examples were hand-validated by a researcher from the Institute of Criminology and a practicing lawyer.&#13;
&#13;
The dataset is derived from four Slovenian laws:&#13;
&#13;
- Kazenski zakonik (KZ-1) — Criminal Code (https://pisrs.si/pregledPredpisa?id=ZAKO5050)&#13;
- Stvarnopravni zakonik (SPZ) — Law of Property Code (https://pisrs.si/pregledPredpisa?id=ZAKO3242)&#13;
- Zakon o varstvu osebnih podatkov (ZVOP-2) — Personal Data Protection Act (https://pisrs.si/pregledPredpisa?id=ZAKO7959)&#13;
- Obligacijski zakonik (OZ) — Obligations Code (https://pisrs.si/pregledPredpisa?id=ZAKO1263)&#13;
&#13;
The dataset is provided in JSONL format.
</summary>
<dc:date>2026-03-19T00:00:00Z</dc:date>
</entry>
<entry>
<title>Slovenian translation corpus Spook 1.1</title>
<link href="http://hdl.handle.net/11356/2077" rel="alternate"/>
<author>
<name>Vintar, Špela</name>
</author>
<author>
<name>Gorjanc, Vojko</name>
</author>
<author>
<name>Erjavec, Tomaž</name>
</author>
<author>
<name>Fišer, Darja</name>
</author>
<author>
<name>Mezeg, Adriana</name>
</author>
<id>http://hdl.handle.net/11356/2077</id>
<updated>2026-03-11T11:48:05Z</updated>
<published>2026-03-10T00:00:00Z</published>
<summary type="text">Slovenian translation corpus Spook 1.1
Vintar, Špela; Gorjanc, Vojko; Erjavec, Tomaž; Fišer, Darja; Mezeg, Adriana
The Spook corpus was compiled to enable corpus-based studies in translation and comprises 713 texts and about  375 thousand words. It is composed of three types of texts. The first comprises foreign language texts in French, English, German, and Italian. The second type are the corresponding texts is in Slovenian. These two types of texts are aligned on the sentence level and comparable in terms of genre and time of publication. The third type of texts consists of original Slovenian texts, and is comparable to the Slovenian part of the parallel corpora. &#13;
The transcription of the texts and paragraph-level alignment of the originals/transations was performed manually.&#13;
&#13;
The texts have been automatically tokenised, sentence segmented, PoS tagged and lemmatised in 2012. Linguistic processing of Slovenian texts was performed by ToTaLe (which used TnT for PoS tagging and CLOG for lemmatisation), while German, English, French and Italian texts were analysed by TreeTagger. The PoS tags in the corpus are given in two variants. One set is as output by the tagger, which is the MULTEXT-East tag for Slovenian (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), while other other sets are as output by TreeTagger for each language. The second variant of PoS tags is a mapping of the original tags to the Spook tagset (https://nl.ijs.si/spook/msd/html-en/).&#13;
&#13;
Version 1.0 was released in the scope of the project in 2021 but was available only to project participants. This version updates the TEI encoding of the corpus and changes the vertical files so that they also include the SPOOK tags as attribute-value pairs. It also removes the parallel fiction part of the corpus (2 x 35 texts) due to copyright considerations. Note, however, that these texts are included in the concordancer-mounted corpus.
</summary>
<dc:date>2026-03-10T00:00:00Z</dc:date>
</entry>
<entry>
<title>Slovene morphological segmentation and word formation dataset KOBOS</title>
<link href="http://hdl.handle.net/11356/2060" rel="alternate"/>
<author>
<name>Pranjić, Marko</name>
</author>
<author>
<name>Kern, Boris</name>
</author>
<author>
<name>Voršič, Ines</name>
</author>
<author>
<name>Pollak, Senja</name>
</author>
<id>http://hdl.handle.net/11356/2060</id>
<updated>2026-03-09T15:54:40Z</updated>
<published>2026-03-13T00:00:00Z</published>
<summary type="text">Slovene morphological segmentation and word formation dataset KOBOS
Pranjić, Marko; Kern, Boris; Voršič, Ines; Pollak, Senja
This dataset provides word-level multidimensional morphological annotations for Slovene, containing 1,935 entries manually annotated by two domain experts. The target words in the dataset were sampled from Sloleks 3.0 to provide data for morphological analysis, computational modeling, and linguistic research.&#13;
The dataset is formatted as a lexicon (.tsv) containing five columns:&#13;
1. word: the target word&#13;
2. part_of_speech: the part-of-speech tag (noun, verb, adjective, adverb, or particle)&#13;
3. morphological_segments: all surface-level morphemes&#13;
4. word_formation_segments: derivational morphemes only&#13;
5. simplex: the base word(s)&#13;
&#13;
The dataset captures three distinct dimensions of morphological analysis, which are defined as follows:&#13;
&#13;
Morphological segments (the 'morphological_segments' column) identify all surface-level morphemes in a word, including both derivational and inflectional affixes. This segmentation describes how a word is modified to fit its grammatical role (such as encoding case, gender, and number).&#13;
&#13;
Word formation segments (the 'word_formation_segments' column) focus exclusively on the derivational processes used to create new words. Because inflectional morphology is a separate process that only modifies existing words, inflectional endings are excluded from word formation segments. For example, the adjective "nepozidan" ('not built-up') has the morphological segmentation "ne-po-zid-a-n-0" (capturing the inflectional state), whereas its word formation segmentation is "ne-po-zida-n", reflecting its specific derivational chain (zidati -&gt; pozidati -&gt; pozidan -&gt; nepozidan).&#13;
&#13;
Zero-morphemes are integrated directly into both segmentation columns (represented by the character "0"). A zero-morpheme represents a morpheme without a phonetic form that is used to mark grammatical distinctions not explicitly realized in speech. It can function as both an inflectional morpheme (e.g., marking nominative masculine nouns that lack an explicit suffix) and a word formation morpheme necessary for deriving a specific part of speech from a base word.&#13;
&#13;
Simplex (the 'simplex' column) represents the corresponding absolute base word(s) that have not been formed through any word formation process. A simplex cannot be further divided into two or more word formation morphemes. For example, the participle "leteč" ('flying') has the simplex "leteti" ('to fly') rather than the noun "let" ('flight'). In cases of compound words, the simplex column contains multiple base words separated by a comma (e.g., the adjective "trikolesen" ('three-wheeled') has the simplexes "tri, kolo").&#13;
&#13;
The annotations achieved high inter-annotator agreement (86.80% Krippendorff's Alpha for morphological segmentation, and 85.16% for word formation segments). This is the first publicly available Slovene dataset combining morphological segmentation, word formation segmentation, zero-morphemes, and simplex annotations in a single resource.
</summary>
<dc:date>2026-03-13T00:00:00Z</dc:date>
</entry>
<entry>
<title>Training corpus of spoken Slovenian ROG 1.1</title>
<link href="http://hdl.handle.net/11356/2062" rel="alternate"/>
<author>
<name>Verdonik, Darinka</name>
</author>
<author>
<name>Dobrovoljc, Kaja</name>
</author>
<author>
<name>Rupnik, Peter</name>
</author>
<author>
<name>Ljubešić, Nikola</name>
</author>
<author>
<name>Majhenič, Simona</name>
</author>
<author>
<name>Čibej, Jaka</name>
</author>
<author>
<name>Schmidt, Thomas</name>
</author>
<author>
<name>Vidinić, Jasna</name>
</author>
<id>http://hdl.handle.net/11356/2062</id>
<updated>2026-03-04T12:48:26Z</updated>
<published>2026-03-04T00:00:00Z</published>
<summary type="text">Training corpus of spoken Slovenian ROG 1.1
Verdonik, Darinka; Dobrovoljc, Kaja; Rupnik, Peter; Ljubešić, Nikola; Majhenič, Simona; Čibej, Jaka; Schmidt, Thomas; Vidinić, Jasna
Training corpus of spoken Slovenian ROG 1.1 is an improved version of the ROG 1.0 corpus (http://hdl.handle.net/11356/1992). The main differences between the original and the current version are:&#13;
- Manually corrected Prosodic Unit annotations in ROG-Art&#13;
- Release of ROG-Art in ISO TEI format&#13;
- Omission of TextGrid files&#13;
&#13;
The current version preserves the extent of the data and its composition:&#13;
&#13;
1. ROG-SST, which includes selected Gos 2.1 (http://hdl.handle.net/11356/1863) transcriptions with: &#13;
- manually assigned lemmas and morphosyntactic tags according to the MULTEXT-East annotation scheme (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), &#13;
- manual annotations according to the Universal Dependencies annotation scheme (i.e. part-of-speech categories, morphological features and syntactic dependencies)&#13;
&#13;
In total, ROG-SST spans 76341 words and 6108 sentences. ROG-SST is distributed as CONLL-U format (2014-2024) (.conllu files). Project website:  https://spot.ff.uni-lj.si/en/.&#13;
&#13;
2. ROG-Art, which includes: &#13;
- all the annotation layers from the ROG-SST &#13;
- prosodic units annotations &#13;
- disfluencies annotation &#13;
- dialogue acts annotation&#13;
&#13;
ROG-Art is distributed as:&#13;
- EXMARaLDA format (.EXB files)  for viewing with Partitur Editor (https://www.exmaralda.org/)&#13;
- .EXS files and Rog-Art.coma file for searching through the annotated corpus in the EXMARaLDA EXAKT concordancer (https://www.exmaralda.org/)&#13;
- .TRS files for viewing the transcriptions without annotations with Transcriber (https://trans.sourceforge.net/en/presentation.php)&#13;
- ISO TEI files for cross-platform compatibility.&#13;
&#13;
ROG-Art consists of 39001 words in 1969 sentences. WAV files are only available for the ROG-Art part. They must be copied to the WAV folder of the ROG-Art folder structure to enable automatic opening of WAV files in EXMARaLDA or Transcriber tools. WAV recording are single channel, sampled with 44100 Hz, with 16 bit precision.
</summary>
<dc:date>2026-03-04T00:00:00Z</dc:date>
</entry>
</feed>
