CLARIN.SI data & tools

CLARIN.SI data & tools CLARIN.SI repository language resources and tools http://hdl.handle.net/11356/1024 2026-07-06T13:06:35Z 2026-07-06T13:06:35Z Genus (proximum) in the SSKJ2 dictionary senses Perdih, Andrej Bizjak Končar, Aleksandra Divjak Race, Duša Gabrovšek, Dejan Ježovnik, Janoš Krvina, Domen Ledinek, Nina Michelizza, Mija Mirtič, Tanja Petric Žižić, Špela Sušnik, Miha Trojar, Mitja http://hdl.handle.net/11356/2254 2026-06-30T08:46:36Z 2026-06-22T00:00:00Z

Genus (proximum) in the SSKJ2 dictionary senses Perdih, Andrej; Bizjak Končar, Aleksandra; Divjak Race, Duša; Gabrovšek, Dejan; Ježovnik, Janoš; Krvina, Domen; Ledinek, Nina; Michelizza, Mija; Mirtič, Tanja; Petric Žižić, Špela; Sušnik, Miha; Trojar, Mitja The datasets contain sense–genus combinations from the Dictionary of the Slovenian Standard Language, 2nd Edition (Slovar slovenskega knjižnega jezika, druga, dopolnjena in deloma prenovljena izdaja; https://www.fran.si/133/sskj2-slovar-slovenskega-knjiznega-jezika-2). Genus is defined as a word denoting a broad, general category or superordinate class to which a defined word belongs. In the current version, 48,028 noun senses with 3,985 genera are included. Genera were attributed automatically and manually curated. The first dataset (SSKJ2_headword_genus.xml) is focused on senses. Each dictionary sense contains the following information: headword or subheadword, entry ID, sense ID and one or more genera. The second dataset (SSKJ2_genusGroups.xml) is focused on genera. One or more dictionary senses are attributed to each genus; for each dictionary sense, the following information are provided: headword or subheadword, entry ID and sense ID. No distinction between genera has been made with regard to homographs and homonyms. For both XML files, the corresponding XML schemas are provided. In rare cases, adjectival headwords are included, when the sense pertains to a multi-word unit containing an adjective and a substantive. Similarly, some noun senses are excluded, if they pertain to non-nominal phrases or are defined only by synonyms. In the current version, words such as vsak, vsaka, vsako, and del, which form syntactic heads, are treated as genera. All genera are single-word units, even in cases where multi-word units would be expected.

2026-06-22T00:00:00Z AI-generated text corpus AI-GenT 1.0 Terčon, Luka Dobrovoljc Zor, Kaja http://hdl.handle.net/11356/2210 2026-06-24T12:47:40Z 2026-06-24T00:00:00Z

AI-generated text corpus AI-GenT 1.0 Terčon, Luka; Dobrovoljc Zor, Kaja The AI-Generated Text (AI-GenT) corpus is a collection of English and Slovenian texts generated by several large language models. The corpus has been used in comparisons to collections of human-written texts in order to investigate the linguistic characteristics of the language generated by LLMs. The current version of the corpus contains texts that were constructed based on two preexisting human-written text corpora: the Šolar 3.0 corpus of Slovenian student essays (http://hdl.handle.net/11356/1589) and the LOCNESS corpus of English native speaker student essays (provided by the Centre for English Corpus Linguistics (CECL) at Université catholique de Louvain in Belgium - https://www.learnercorpusassociation.org/resources/tools/locness-corpus/). Three different LLMs—GPT-5 (https://developers.openai.com/api/docs/models/gpt-5), GaMS-27B (https://huggingface.co/cjvt/GaMS-27B-Instruct), and gemma-2-27b (https://huggingface.co/google/gemma-2-27b-it)—were instructed to produce corresponding texts to the texts in the human-written corpora using prompts containing information about the topic and length of the desired output. The AI-generated texts were produced by taking various subsets of the original human-written corpora as the basis for constructing the input prompts. For a full overview of the data, model, and prompt type combinations used to generate the AI-generated texts, please refer to the included AI-GenT_structure.png file which includes a full visual representation of the corpus structure. The corpus contains the AI-generated texts both in the form of raw text files as well as in the CoNLL-U file format containing grammatical annotations following the UD system of annotation (https://universaldependencies.org/). UD annotations were generated using the Trankit NLP pipeline (https://aclanthology.org/2021.eacl-demos.10/) with the default model used for English and a custom model used for Slovenian that is retrained on UD v2.15 data (http://hdl.handle.net/11356/1997). In the future, the corpus is planned to be extended with additional AI-generated news articles and Wikipedia articles.

2026-06-24T00:00:00Z Monitor corpus of Slovene Trendi 2026-05 Kosem, Iztok Čibej, Jaka Dobrovoljc, Kaja Erjavec, Tomaž Ljubešić, Nikola Ponikvar, Primož Šinkec, Mihael Krek, Simon http://hdl.handle.net/11356/2219 2026-06-07T09:22:35Z 2026-06-07T00:00:00Z

Monitor corpus of Slovene Trendi 2026-05 Kosem, Iztok; Čibej, Jaka; Dobrovoljc, Kaja; Erjavec, Tomaž; Ljubešić, Nikola; Ponikvar, Primož; Šinkec, Mihael; Krek, Simon The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 62 publishers. Trendi 2026-05 covers the period from January 2019 to Maj 2026, complementing the Gigafida 2.2 reference corpus of written Slovene (http://hdl.handle.net/11356/2106). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem (iztok.kosem@ijs.si). This version adds texts from May 2026.

2026-06-07T00:00:00Z Slovene instruction-following safety dataset for large language models GaMS-Instruct-SAFE 0.5 Čibej, Jaka Kos, Sara Kastelic, Maja Gabrovšek, Dejan Trojar, Mitja Ježovnik, Janoš Bizjak Končar, Aleksandra Krvina, Domen Petric Žižić, Špela Divjak Race, Duša Vreš, Domen http://hdl.handle.net/11356/2218 2026-06-01T11:01:50Z 2026-06-01T00:00:00Z

Slovene instruction-following safety dataset for large language models GaMS-Instruct-SAFE 0.5 Čibej, Jaka; Kos, Sara; Kastelic, Maja; Gabrovšek, Dejan; Trojar, Mitja; Ježovnik, Janoš; Bizjak Končar, Aleksandra; Krvina, Domen; Petric Žižić, Špela; Divjak Race, Duša; Vreš, Domen GaMS-Instruct-SAFE is a an instruction-following safety dataset designed to fine-tune Slovene large language models to provide safe responses (i.e. to train them to refuse responding to prompts that could lead to physical, economic or psychological harm). It consists of pairs of prompts and responses with various safety topics (e.g. sexual harassment, terrorism, violent crime, drugs). The prompts were written by human annotators using LabelStudio (Tkachenko et al. 2025) based on provided set of criteria (such as topic, expected prompt length, language standardness, different jailbreak strategies) to make the dataset as varied as possible (see Čibej 2024 for more details). In version 0.5, the responses to the prompts were generated using GaMS-27B-Instruct-Nemotron (https://huggingface.co/cjvt/GaMS-27B-Instruct-Nemotron). Only prompt-response pairs in which the model refused to cooperate were included. More responses will be added in future versions. The annotations for this dataset were created using Label Studio, open-source data labeling software developed by Heartex (Tkachenko et al. 2025). References: Čibej, Jaka, 2024: First steps toward the compilation of a safety dataset for Slovene large language models. Jezikovne tehnologije in digitalna humanistika. https://repozitorij.uni-lj.si/IzpisGradiva.php?lang=slv&id=164271 Tkachenko, Maxim, Mikhail Malyuk, Andrey Holmanyuk, Nikolai Liubimov, 2025: Label Studio: Data labeling software. https://github.com/HumanSignal/label-studio

2026-06-01T00:00:00Z Slovenian Day of Resistance X & news corpus 1.1 Koražija, Jure Horvat, Marjan Babnik, Jan Škvorc, Tadej Robnik-Šikonja, Marko Darovec, Darko Oman, Žiga http://hdl.handle.net/11356/2216 2026-05-26T13:01:47Z 2026-05-26T00:00:00Z

Slovenian Day of Resistance X & news corpus 1.1 Koražija, Jure; Horvat, Marjan; Babnik, Jan; Škvorc, Tadej; Robnik-Šikonja, Marko; Darovec, Darko; Oman, Žiga The dataset contains social media posts from X and traditional media articles from online news sources related to the Slovenian commemorations of the Day of Resistance. We used two types of data: For the social media analysis, we collected X posts covering the period from April 2023 to April 2024. This dataset was gathered by Sciences Po under the SoMe4Dem project. The collection focused on commemorative discussions in Slovenian and comprised 753 posts. The X dataset was compiled using the query terms “Dan upora proti okupatorju” and “Dan upora”, with special-character normalization to ensure broader retrieval of relevant posts. To analyze traditional media, we collected relevant news articles using Media Cloud (https://www.mediacloud.org/), an open-source platform developed by the Berkman Klein Center for Internet & Society at Harvard University, which compiles and organizes online news content to facilitate research on attention, representation, influence, and language in global media ecosystems. The Slovenian database was queried using the following 14 case-sensitive keywords: »dan upora«, »dnevu upora«, »dan OF«, »dneva OF«, »proti okupatorju«, »državna proslava«, »državne proslave«, »državni proslavi«, »dan spomina«, »dnevu spomina«, »osvobodilna fronta«, »osvobodilne fronte«, »protiimperialistična fronta« and »protiimperialistične fronte«. Additional news material was collected through links found in the X dataset and manually retrieved from three Slovenian weekly publications: Delo, Demokracija, and Mladina. We included all relevant news articles published on this topic for three consecutive years, from 2022 to 2024. After collecting traditional media news articles from Media Cloud and X links, 144 irrelevant or duplicated articles were identified, thus reducing the media part of our dataset from 308 to 164 articles. For publication and data-sharing purposes, version 1.1 transforms the original version of the X dataset into an anonymized, feature-based analytical dataset. The published version contains post-level entries with derived features such as Greimasian actantial coding, actant clusters, actor and character fields, author stance, antagonism score, discourse-function indicators (+ action coding), HDBSCAN-based cluster information, and average cluster scores.

2026-05-26T00:00:00Z The Trankit model for linguistic processing of written and spoken Slovenian 1.3 Krsnik, Luka Dobrovoljc, Kaja Terčon, Luka http://hdl.handle.net/11356/2201 2026-05-26T08:26:46Z 2026-05-25T00:00:00Z

The Trankit model for linguistic processing of written and spoken Slovenian 1.3 Krsnik, Luka; Dobrovoljc, Kaja; Terčon, Luka This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type. To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. Version 1.3 was trained on the same data as version 1.2, except that spoken SST data (UD v2.15) was augmented by colloquial (non-standardized) transcriptions of spoken Slovenian alongside the standardized ones. The resulting model achieves state-of-the-art performance on both standardized (e.g. "včasih govorimo takole") and colloquial speech transcriptions (e.g. "včas govorimo tkole"), without affecting the performance on written data.

2026-05-25T00:00:00Z Phonetic segmentation and acoustic measurements of spoken Slovenian SloPhonSeg 1.0 Robida, Nejc Križaj, Janez http://hdl.handle.net/11356/2209 2026-05-21T10:55:46Z 2026-05-22T00:00:00Z

Phonetic segmentation and acoustic measurements of spoken Slovenian SloPhonSeg 1.0 Robida, Nejc; Križaj, Janez SloPhonSeg 1.0 is a dataset of automatically generated phonetic segmentations and acoustic-phonetic measurements for selected recordings and transcriptions from the spoken corpus Gos 2.1 (http://hdl.handle.net/11356/1863). The resource contains derived data in two complementary formats: Praat TextGrid annotation files with time-aligned segmentations, and TSV tables with acoustic-phonetic measurements. The TextGrid files align the transcriptions with the audio recordings and segment them into utterances, words, syllables, and phones, while the TSV files provide one row per phone-tier interval and include acoustic measurements, phone context, and token-level metadata. The packaged sample contains 106 recordings and transcriptions. It was selected from recordings in which speakers were marked in the corpus metadata as using standard Slovene, supplemented by additional recordings that were manually confirmed as predominantly standard. The intended selected-speaker sample is gender-balanced, with 66 female and 66 male primary speakers; the packaged recording metadata lists all speakers present in the selected recordings, including additional participants and group/audience identifiers. The sample is based on the five Gos 2.1 subcorpora, with the following distribution: (1) Spoken corpus Gos 1.1 (http://hdl.handle.net/11356/1438), labelled as Gos in the metadata, 22 recordings, 99,328 source-metadata word tokens. (2) Spoken corpus Gos VideoLectures (http://hdl.handle.net/11356/1444), labelled as GosVL in the metadata, 15 recordings, 52,402 source-metadata word tokens. (3) A selection from the ASR database ARTUR 1.0 (http://hdl.handle.net/11356/1772), including: (3a) Artur-J, 49 recordings, 301,830 source-metadata word tokens: interviews and online events, such as conferences, workshops, and educational videos. (3b) Artur-P, 17 recordings, 32,607 source-metadata word tokens: transcribed speech from the Slovene National Assembly. (3c) Artur-N, 3 recordings, 5,112 source-metadata word tokens: non-public speech. The resource provides three parallel versions of the segmentation, differing in the level of phonetic detail: an allophonic phone-level segmentation with 61 phone labels, a diphthong segmentation with 82 phone labels, and a simplified phonemic segmentation with 44 phone labels. Each version contains 106 TextGrid files and 106 measurement TSV files; the aggregate segmentation statistics report 2,951.55 speech minutes and 395,282 aligned word tokens per version. The segmentations were produced through forced alignment using the Montreal Forced Aligner (MFA), a Slovene acoustic model, and a pronunciation workflow based on OptiLEX. The TextGrid files contain tiers for speaker identifiers, standardised and conversational transcriptions, word identifiers, word and syllable segments, phone segments, and automatically generated prosodic and discourse-related cues. The TSV files report duration, average pitch, pitch trend, formant frequencies, intensity, sonority, automatically computed Voice Onset Time (VOT), centre of gravity (COG), preceding and following phone labels, aligned token identifier, MULTEXT-East morphosyntactic description (MSD), utterance context, audio identifier, and speaker identifier. The corresponding audio (and, in part, video) files are available under a restricted licence at http://hdl.handle.net/11356/1973.

2026-05-22T00:00:00Z Disasters corpus in classical Arabic sources DiCCAS Cicola, Ilaria Pannitto, Ludovica Peta, Ines Fontana, Chiara Norozi, Nahid Demichelis, Marco Aiello, Giulia http://hdl.handle.net/11356/2097 2026-04-21T14:16:23Z 2026-03-12T00:00:00Z

Disasters corpus in classical Arabic sources DiCCAS Cicola, Ilaria; Pannitto, Ludovica; Peta, Ines; Fontana, Chiara; Norozi, Nahid; Demichelis, Marco; Aiello, Giulia The Disasters corpus in classical Arabic sources (DiCCAS) is designed to allow historians to compare different accounts and narratives of disasters in a variety of classical Arabic sources. The corpus encompasses a diverse range of materials, including the Qur’an and the ḥadīth collections Saḥīḥ al-Bukhārī and Saḥīḥ Muslim, as well as several significant historical works, such as al-Ṭabarī’s Kitāb Tārīkh al-rusul wa-l-mulūk and Ibn Taghrībirdī’s Kitāb al-Nujūm al-zāhira fī mulūk Miṣr wa-l-Qāhira. The corpus also incorporates adab texts by al-Jāhiẓ, notably his Rasāʾil, and Ibn al-Jawzī’s al-Mudhish. The DiCCAS corpus is encoded using the Text Encoding Initiative (TEI) Guidelines which gives the structure of the corpus and marks disaster related words. It is also available in vertical format, which adds linguistic annotations, i.e. tokenisation, lemmatisations and PoS tagging.

2026-03-12T00:00:00Z Slovene Lexicographic QA Fine-Tuning Corpus SloLexQA 1.0 Knez, Timotej Žitnik, Slavko http://hdl.handle.net/11356/2116 2026-06-24T12:49:49Z 2026-04-14T00:00:00Z

Slovene Lexicographic QA Fine-Tuning Corpus SloLexQA 1.0 Knez, Timotej; Žitnik, Slavko The Slovene Lexicographic QA Fine-Tuning Corpus is a specialized dataset designed to advance the performance of AI models in understanding the structural, grammatical, and semantic nuances of the Slovene language. Comprising over 16,000 question-answer pairs, the corpus shifts away from general knowledge to focus on high-quality lexicographic data, including morphology, lemmatization, and part-of-speech identification. It serves as a critical resource for fine-tuning models to act as sophisticated linguistic assistants. The dataset integrates diverse sources, ranging from automatically generated content based on the Digital Dictionary Database of Slovene (DDDS) to manual expert advice from the Jezikovna svetovalnica portal. This hybrid approach ensures a robust mix of systematic grammatical queries and nuanced, real-world linguistic explanations. With a significant portion of the data derived from annotated linguistic corpora like SSJ500k, the dataset provides a reliable foundation for training models in both context-free definitions and context-dependent usage scenarios. Technically, the corpus is structured for high utility in machine learning workflows, featuring a 90/10 training and test split with metadata for each entry. It categorizes questions into specific types such as definitions and usage examples, allowing researchers to perform targeted domain adaptation. By providing clear links between questions and specific lexemes, the corpus enables precise evaluation of a model's ability to navigate the formal rules and practical applications of the Slovene lexicon.

2026-04-14T00:00:00Z Corpus of written standard Slovene Gigafida 2.2 Krek, Simon Erjavec, Tomaž Repar, Andraž Čibej, Jaka Arhar Holdt, Špela Gantar, Polona Kosem, Iztok Robnik-Šikonja, Marko Ljubešić, Nikola Dobrovoljc, Kaja Laskowski, Cyprian Grčar, Miha Holozan, Peter Šuster, Simon Gorjanc, Vojko Stabej, Marko Logar, Nataša Terčon, Luka Škvorc, Tadej http://hdl.handle.net/11356/2106 2026-04-02T14:51:00Z 2025-12-08T00:00:00Z

Corpus of written standard Slovene Gigafida 2.2 Krek, Simon; Erjavec, Tomaž; Repar, Andraž; Čibej, Jaka; Arhar Holdt, Špela; Gantar, Polona; Kosem, Iztok; Robnik-Šikonja, Marko; Ljubešić, Nikola; Dobrovoljc, Kaja; Laskowski, Cyprian; Grčar, Miha; Holozan, Peter; Šuster, Simon; Gorjanc, Vojko; Stabej, Marko; Logar, Nataša; Terčon, Luka; Škvorc, Tadej Gigafida 2.2 is a reference corpus of written Slovene texts published in the period 1990-2018. It is comprised of daily news, magazines, a selection of web texts (a certain portion of which covers news texts as well), and different types of publications (fiction, school books, and non-fiction). The texts have been selected and automatically processed with the aim of creating a corpus that represents a sample of modern standard Slovene and can be used for research in linguistics and other branches of the humanities, for compiling modern dictionaries, grammars, and learning materials, as well as for developing language technologies for Slovene. The main novelty of version 2.2 is the segmentation of texts from the newspapers Delo and Dnevnik, which represent the largest share of newspaper texts in the corpus. In version 2.1, these texts contained entire daily editions of newspapers with articles on various topics. Using a combination of automatic and manual methods, we segmented the editions into individual articles. In this way, the Gigafida corpus is also better prepared for future upgrades, as newer journalistic texts (e.g., the ones collected for the monitor corpus Trendi) are already being collected in the form of individual articles. A few other improvements have been made, e.g. invalid tags in several files which caused for the texts to be excluded from the corpus when uploaded to the concordances. On the other hand, Gigafida 2.2 does not contain Semantic role labels and Named Entity annotations. References: Simon Krek, Špela Arhar Holdt, Tomaž Erjavec, Jaka Čibej, Andraz Repar, Polona Gantar, Nikola Ljubešić, Iztok Kosem and Kaja Dobrovoljc. Gigafida 2.0: The Reference Corpus of Written Standard Slovene. Proceedings of The 12th Language Resources and Evaluation Conference. Marseille, May 2020. https://www.aclweb.org/anthology/2020.lrec-1.409/ LOGAR BERGINC, Nataša, GRČAR, Miha, BRAKUS, Marko, ERJAVEC, Tomaž, ARHAR HOLDT, Špela and KREK, Simon. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede, 2012. https://doi.org/10.4312/9789610603542

2025-12-08T00:00:00Z