<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
<title>CLARIN.SI data &amp; tools</title>
<link href="http://hdl.handle.net/11356/1024" rel="alternate"/>
<subtitle>CLARIN.SI repository language resources and tools</subtitle>
<id>http://hdl.handle.net/11356/1024</id>
<updated>2026-05-26T21:03:42Z</updated>
<dc:date>2026-05-26T21:03:42Z</dc:date>
<entry>
<title>Slovenian Day of Resistance X &amp; news corpus 1.1</title>
<link href="http://hdl.handle.net/11356/2216" rel="alternate"/>
<author>
<name>Koražija, Jure</name>
</author>
<author>
<name>Horvat, Marjan</name>
</author>
<author>
<name>Babnik, Jan</name>
</author>
<author>
<name>Škvorc, Tadej</name>
</author>
<author>
<name>Robnik-Šikonja, Marko</name>
</author>
<author>
<name>Darovec, Darko</name>
</author>
<author>
<name>Oman, Žiga</name>
</author>
<id>http://hdl.handle.net/11356/2216</id>
<updated>2026-05-26T13:01:47Z</updated>
<published>2026-05-26T00:00:00Z</published>
<summary type="text">Slovenian Day of Resistance X &amp; news corpus 1.1
Koražija, Jure; Horvat, Marjan; Babnik, Jan; Škvorc, Tadej; Robnik-Šikonja, Marko; Darovec, Darko; Oman, Žiga
The dataset contains social media posts from X and traditional media articles from online news sources related to the Slovenian commemorations of the Day of Resistance. &#13;
&#13;
We used two types of data: For the social media analysis, we collected X posts covering the period from April 2023 to April 2024. This dataset was gathered by Sciences Po under the SoMe4Dem project. The collection focused on commemorative discussions in Slovenian and comprised 753 posts. The X dataset was compiled using the query terms “Dan upora proti okupatorju” and “Dan upora”, with special-character normalization to ensure broader retrieval of relevant posts. &#13;
&#13;
To analyze traditional media, we collected relevant news articles using Media Cloud (https://www.mediacloud.org/), an open-source platform developed by the Berkman Klein Center for Internet &amp; Society at Harvard University, which compiles and organizes online news content to facilitate research on attention, representation, influence, and language in global media ecosystems. The Slovenian database was queried using the following 14 case-sensitive keywords: »dan upora«, »dnevu upora«, »dan OF«, »dneva OF«, »proti okupatorju«, »državna proslava«, »državne proslave«, »državni proslavi«, »dan spomina«, »dnevu spomina«, »osvobodilna fronta«, »osvobodilne fronte«, »protiimperialistična fronta« and »protiimperialistične fronte«. Additional news material was collected through links found in the X dataset and manually retrieved from three Slovenian weekly publications: Delo, Demokracija, and Mladina. We included all relevant news articles published on this topic for three consecutive years, from 2022 to 2024. &#13;
&#13;
After collecting traditional media news articles from Media Cloud and X links, 144 irrelevant or duplicated articles were identified, thus reducing the media part of our dataset from 308 to 164 articles.&#13;
&#13;
For publication and data-sharing purposes, version 1.1 transforms the original version of the X dataset into an anonymized, feature-based analytical dataset. The published version contains post-level entries with derived features such as Greimasian actantial coding, actant clusters, actor and character fields, author stance, antagonism score, discourse-function indicators (+ action coding), HDBSCAN-based cluster information, and average cluster scores.
</summary>
<dc:date>2026-05-26T00:00:00Z</dc:date>
</entry>
<entry>
<title>The Trankit model for linguistic processing of written and spoken Slovenian 1.3</title>
<link href="http://hdl.handle.net/11356/2201" rel="alternate"/>
<author>
<name>Krsnik, Luka</name>
</author>
<author>
<name>Dobrovoljc, Kaja</name>
</author>
<author>
<name>Terčon, Luka</name>
</author>
<id>http://hdl.handle.net/11356/2201</id>
<updated>2026-05-26T08:26:46Z</updated>
<published>2026-05-25T00:00:00Z</published>
<summary type="text">The Trankit model for linguistic processing of written and spoken Slovenian 1.3
Krsnik, Luka; Dobrovoljc, Kaja; Terčon, Luka
This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). &#13;
&#13;
It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). &#13;
&#13;
In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type.&#13;
&#13;
To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base.&#13;
&#13;
Version 1.3 was trained on the same data as version 1.2, except that spoken SST data (UD v2.15) was augmented by colloquial (non-standardized) transcriptions of spoken Slovenian alongside the standardized ones. The resulting model achieves state-of-the-art performance on both standardized (e.g. "včasih govorimo takole") and colloquial speech transcriptions (e.g. "včas govorimo tkole"), without affecting the performance on written data.
</summary>
<dc:date>2026-05-25T00:00:00Z</dc:date>
</entry>
<entry>
<title>Phonetic segmentation and acoustic measurements of spoken Slovenian SloPhonSeg 1.0</title>
<link href="http://hdl.handle.net/11356/2209" rel="alternate"/>
<author>
<name>Robida, Nejc</name>
</author>
<author>
<name>Križaj, Janez</name>
</author>
<id>http://hdl.handle.net/11356/2209</id>
<updated>2026-05-21T10:55:46Z</updated>
<published>2026-05-22T00:00:00Z</published>
<summary type="text">Phonetic segmentation and acoustic measurements of spoken Slovenian SloPhonSeg 1.0
Robida, Nejc; Križaj, Janez
SloPhonSeg 1.0 is a dataset of automatically generated phonetic segmentations and acoustic-phonetic measurements for selected recordings and transcriptions from the spoken corpus Gos 2.1 (http://hdl.handle.net/11356/1863).&#13;
&#13;
The resource contains derived data in two complementary formats: Praat TextGrid annotation files with time-aligned segmentations, and TSV tables with acoustic-phonetic measurements. The TextGrid files align the transcriptions with the audio recordings and segment them into utterances, words, syllables, and phones, while the TSV files provide one row per phone-tier interval and include acoustic measurements, phone context, and token-level metadata.&#13;
&#13;
The packaged sample contains 106 recordings and transcriptions. It was selected from recordings in which speakers were marked in the corpus metadata as using standard Slovene, supplemented by additional recordings that were manually confirmed as predominantly standard. The intended selected-speaker sample is gender-balanced, with 66 female and 66 male primary speakers; the packaged recording metadata lists all speakers present in the selected recordings, including additional participants and group/audience identifiers. The sample is based on the five Gos 2.1 subcorpora, with the following distribution:&#13;
(1) Spoken corpus Gos 1.1 (http://hdl.handle.net/11356/1438), labelled as Gos in the metadata, 22 recordings, 99,328 source-metadata word tokens.&#13;
(2) Spoken corpus Gos VideoLectures (http://hdl.handle.net/11356/1444), labelled as GosVL in the metadata, 15 recordings, 52,402 source-metadata word tokens.&#13;
(3) A selection from the ASR database ARTUR 1.0 (http://hdl.handle.net/11356/1772), including:&#13;
(3a) Artur-J, 49 recordings, 301,830 source-metadata word tokens: interviews and online events, such as conferences, workshops, and educational videos.&#13;
(3b) Artur-P, 17 recordings, 32,607 source-metadata word tokens: transcribed speech from the Slovene National Assembly.&#13;
(3c) Artur-N, 3 recordings, 5,112 source-metadata word tokens: non-public speech.&#13;
&#13;
The resource provides three parallel versions of the segmentation, differing in the level of phonetic detail: an allophonic phone-level segmentation with 61 phone labels, a diphthong segmentation with 82 phone labels, and a simplified phonemic segmentation with 44 phone labels. Each version contains 106 TextGrid files and 106 measurement TSV files; the aggregate segmentation statistics report 2,951.55 speech minutes and 395,282 aligned word tokens per version.&#13;
&#13;
The segmentations were produced through forced alignment using the Montreal Forced Aligner (MFA), a Slovene acoustic model, and a pronunciation workflow based on OptiLEX. The TextGrid files contain tiers for speaker identifiers, standardised and conversational transcriptions, word identifiers, word and syllable segments, phone segments, and automatically generated prosodic and discourse-related cues. The TSV files report duration, average pitch, pitch trend, formant frequencies, intensity, sonority, automatically computed Voice Onset Time (VOT), centre of gravity (COG), preceding and following phone labels, aligned token identifier, MULTEXT-East morphosyntactic description (MSD), utterance context, audio identifier, and speaker identifier.&#13;
&#13;
The corresponding audio (and, in part, video) files are available under a restricted licence at http://hdl.handle.net/11356/1973.
</summary>
<dc:date>2026-05-22T00:00:00Z</dc:date>
</entry>
<entry>
<title>Monitor corpus of Slovene Trendi 2026-04</title>
<link href="http://hdl.handle.net/11356/2154" rel="alternate"/>
<author>
<name>Kosem, Iztok</name>
</author>
<author>
<name>Čibej, Jaka</name>
</author>
<author>
<name>Dobrovoljc, Kaja</name>
</author>
<author>
<name>Erjavec, Tomaž</name>
</author>
<author>
<name>Ljubešić, Nikola</name>
</author>
<author>
<name>Ponikvar, Primož</name>
</author>
<author>
<name>Šinkec, Mihael</name>
</author>
<author>
<name>Krek, Simon</name>
</author>
<id>http://hdl.handle.net/11356/2154</id>
<updated>2026-05-06T15:17:52Z</updated>
<published>2026-05-06T00:00:00Z</published>
<summary type="text">Monitor corpus of Slovene Trendi 2026-04
Kosem, Iztok; Čibej, Jaka; Dobrovoljc, Kaja; Erjavec, Tomaž; Ljubešić, Nikola; Ponikvar, Primož; Šinkec, Mihael; Krek, Simon
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 61 publishers. Trendi 2026-04 covers the period from January 2019 to April 2026, complementing the Gigafida 2.2 reference corpus of written Slovene (http://hdl.handle.net/11356/2106).&#13;
&#13;
The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).&#13;
&#13;
An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).&#13;
&#13;
The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem (iztok.kosem@ijs.si).&#13;
&#13;
This version adds texts from April 2026.
</summary>
<dc:date>2026-05-06T00:00:00Z</dc:date>
</entry>
<entry>
<title>Disasters corpus in classical Arabic sources DiCCAS</title>
<link href="http://hdl.handle.net/11356/2097" rel="alternate"/>
<author>
<name>Cicola, Ilaria</name>
</author>
<author>
<name>Pannitto, Ludovica</name>
</author>
<author>
<name>Peta, Ines</name>
</author>
<author>
<name>Fontana, Chiara</name>
</author>
<author>
<name>Norozi, Nahid</name>
</author>
<author>
<name>Demichelis, Marco</name>
</author>
<author>
<name>Aiello, Giulia</name>
</author>
<id>http://hdl.handle.net/11356/2097</id>
<updated>2026-04-21T14:16:23Z</updated>
<published>2026-03-12T00:00:00Z</published>
<summary type="text">Disasters corpus in classical Arabic sources DiCCAS
Cicola, Ilaria; Pannitto, Ludovica; Peta, Ines; Fontana, Chiara; Norozi, Nahid; Demichelis, Marco; Aiello, Giulia
The Disasters corpus in classical Arabic sources (DiCCAS) is designed to allow historians to compare different accounts and narratives of disasters in a variety of classical Arabic sources. &#13;
&#13;
The corpus encompasses a diverse range of materials, including the Qur’an and the ḥadīth collections Saḥīḥ al-Bukhārī and Saḥīḥ Muslim, as well as several significant historical works, such as al-Ṭabarī’s Kitāb Tārīkh al-rusul wa-l-mulūk and Ibn Taghrībirdī’s Kitāb al-Nujūm al-zāhira fī mulūk Miṣr wa-l-Qāhira. The corpus also incorporates adab texts by al-Jāhiẓ, notably his Rasāʾil, and Ibn al-Jawzī’s al-Mudhish. &#13;
&#13;
The DiCCAS corpus is encoded using the Text Encoding Initiative (TEI) Guidelines which gives the structure of the corpus and marks disaster related words. It is also available in vertical format, which adds linguistic annotations, i.e. tokenisation, lemmatisations and PoS tagging.
</summary>
<dc:date>2026-03-12T00:00:00Z</dc:date>
</entry>
<entry>
<title>Slovene Lexicographic QA Fine-Tuning Corpus SloLexQA 1.0</title>
<link href="http://hdl.handle.net/11356/2116" rel="alternate"/>
<author>
<name>Knez, Timotej</name>
</author>
<author>
<name>Žitnik, Slavko</name>
</author>
<id>http://hdl.handle.net/11356/2116</id>
<updated>2026-04-14T13:07:06Z</updated>
<published>2026-04-14T00:00:00Z</published>
<summary type="text">Slovene Lexicographic QA Fine-Tuning Corpus SloLexQA 1.0
Knez, Timotej; Žitnik, Slavko
The Slovene Lexicographic QA Fine-Tuning Corpus is a specialized dataset designed to advance the performance of AI models in understanding the structural, grammatical, and semantic nuances of the Slovene language. Comprising over 16,000 question-answer pairs, the corpus shifts away from general knowledge to focus on high-quality lexicographic data, including morphology, lemmatization, and part-of-speech identification. It serves as a critical resource for fine-tuning models to act as sophisticated linguistic assistants.&#13;
&#13;
The dataset integrates diverse sources, ranging from automatically generated content based on the Digital Dictionary Database of Slovene (DDDS) to manual expert advice from the Jezikovna svetovalnica portal. This hybrid approach ensures a robust mix of systematic grammatical queries and nuanced, real-world linguistic explanations. With a significant portion of the data derived from annotated linguistic corpora like SSJ500k, the dataset provides a reliable foundation for training models in both context-free definitions and context-dependent usage scenarios.&#13;
&#13;
Technically, the corpus is structured for high utility in machine learning workflows, featuring a 90/10 training and test split with metadata for each entry. It categorizes questions into specific types such as definitions and usage examples, allowing researchers to perform targeted domain adaptation. By providing clear links between questions and specific lexemes, the corpus enables precise evaluation of a model's ability to navigate the formal rules and practical applications of the Slovene lexicon.
</summary>
<dc:date>2026-04-14T00:00:00Z</dc:date>
</entry>
<entry>
<title>Corpus of written standard Slovene Gigafida 2.2</title>
<link href="http://hdl.handle.net/11356/2106" rel="alternate"/>
<author>
<name>Krek, Simon</name>
</author>
<author>
<name>Erjavec, Tomaž</name>
</author>
<author>
<name>Repar, Andraž</name>
</author>
<author>
<name>Čibej, Jaka</name>
</author>
<author>
<name>Arhar Holdt, Špela</name>
</author>
<author>
<name>Gantar, Polona</name>
</author>
<author>
<name>Kosem, Iztok</name>
</author>
<author>
<name>Robnik-Šikonja, Marko</name>
</author>
<author>
<name>Ljubešić, Nikola</name>
</author>
<author>
<name>Dobrovoljc, Kaja</name>
</author>
<author>
<name>Laskowski, Cyprian</name>
</author>
<author>
<name>Grčar, Miha</name>
</author>
<author>
<name>Holozan, Peter</name>
</author>
<author>
<name>Šuster, Simon</name>
</author>
<author>
<name>Gorjanc, Vojko</name>
</author>
<author>
<name>Stabej, Marko</name>
</author>
<author>
<name>Logar, Nataša</name>
</author>
<author>
<name>Terčon, Luka</name>
</author>
<author>
<name>Škvorc, Tadej</name>
</author>
<id>http://hdl.handle.net/11356/2106</id>
<updated>2026-04-02T14:51:00Z</updated>
<published>2025-12-08T00:00:00Z</published>
<summary type="text">Corpus of written standard Slovene Gigafida 2.2
Krek, Simon; Erjavec, Tomaž; Repar, Andraž; Čibej, Jaka; Arhar Holdt, Špela; Gantar, Polona; Kosem, Iztok; Robnik-Šikonja, Marko; Ljubešić, Nikola; Dobrovoljc, Kaja; Laskowski, Cyprian; Grčar, Miha; Holozan, Peter; Šuster, Simon; Gorjanc, Vojko; Stabej, Marko; Logar, Nataša; Terčon, Luka; Škvorc, Tadej
Gigafida 2.2 is a reference corpus of written Slovene texts published in the period 1990-2018. It is comprised of daily news, magazines, a selection of web texts (a certain portion of which covers news texts as well), and different types of publications (fiction, school books, and non-fiction). The texts have been selected and automatically processed with the aim of creating a corpus that represents a sample of modern standard Slovene and can be used for research in linguistics and other branches of the humanities, for compiling modern dictionaries, grammars, and learning materials, as well as for developing language technologies for Slovene.&#13;
&#13;
The main novelty of version 2.2 is the segmentation of texts from the newspapers Delo and Dnevnik, which represent the largest share of newspaper texts in the corpus. In version 2.1, these texts contained entire daily editions of newspapers with articles on various topics. Using a combination of automatic and manual methods, we segmented the editions into individual articles. In this way, the Gigafida corpus is also better prepared for future upgrades, as newer journalistic texts (e.g., the ones collected for the monitor corpus Trendi) are already being collected in the form of individual articles. A few other improvements have been made, e.g. invalid tags in several files which caused for the texts to be excluded from the corpus when uploaded to the concordances. On the other hand, Gigafida 2.2 does not contain Semantic role labels and Named Entity annotations.&#13;
&#13;
References:&#13;
Simon Krek, Špela Arhar Holdt, Tomaž Erjavec, Jaka Čibej, Andraz Repar, Polona Gantar, Nikola Ljubešić, Iztok Kosem and Kaja Dobrovoljc. Gigafida 2.0: The Reference Corpus of Written Standard Slovene. Proceedings of The 12th Language Resources and Evaluation Conference. Marseille, May 2020. https://www.aclweb.org/anthology/2020.lrec-1.409/&#13;
&#13;
LOGAR BERGINC, Nataša, GRČAR, Miha, BRAKUS, Marko, ERJAVEC, Tomaž, ARHAR HOLDT, Špela and KREK, Simon. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede, 2012. https://doi.org/10.4312/9789610603542
</summary>
<dc:date>2025-12-08T00:00:00Z</dc:date>
</entry>
<entry>
<title>Corpus of written standard Slovene Gigafida 2.1</title>
<link href="http://hdl.handle.net/11356/2055" rel="alternate"/>
<author>
<name>Krek, Simon</name>
</author>
<author>
<name>Erjavec, Tomaž</name>
</author>
<author>
<name>Repar, Andraž</name>
</author>
<author>
<name>Čibej, Jaka</name>
</author>
<author>
<name>Arhar Holdt, Špela</name>
</author>
<author>
<name>Gantar, Polona</name>
</author>
<author>
<name>Kosem, Iztok</name>
</author>
<author>
<name>Robnik-Šikonja, Marko</name>
</author>
<author>
<name>Ljubešić, Nikola</name>
</author>
<author>
<name>Dobrovoljc, Kaja</name>
</author>
<author>
<name>Laskowski, Cyprian</name>
</author>
<author>
<name>Grčar, Miha</name>
</author>
<author>
<name>Holozan, Peter</name>
</author>
<author>
<name>Šuster, Simon</name>
</author>
<author>
<name>Gorjanc, Vojko</name>
</author>
<author>
<name>Stabej, Marko</name>
</author>
<author>
<name>Logar, Nataša</name>
</author>
<id>http://hdl.handle.net/11356/2055</id>
<updated>2026-04-02T14:49:57Z</updated>
<published>2023-08-03T00:00:00Z</published>
<summary type="text">Corpus of written standard Slovene Gigafida 2.1
Krek, Simon; Erjavec, Tomaž; Repar, Andraž; Čibej, Jaka; Arhar Holdt, Špela; Gantar, Polona; Kosem, Iztok; Robnik-Šikonja, Marko; Ljubešić, Nikola; Dobrovoljc, Kaja; Laskowski, Cyprian; Grčar, Miha; Holozan, Peter; Šuster, Simon; Gorjanc, Vojko; Stabej, Marko; Logar, Nataša
Gigafida 2.1 is a reference corpus of written Slovene texts published in the period 1990-2018. It is comprised of daily news, magazines, a selection of web texts (a certain portion of which covers news texts as well), and different types of publications (fiction, school books, and non-fiction). The texts have been selected and automatically processed with the aim of creating a corpus that represents a sample of modern standard Slovene and can be used for research in linguistics and other branches of the humanities, for compiling modern dictionaries, grammars, and learning materials, as well as for developing language technologies for Slovene.&#13;
&#13;
Version 2.1 contains the same texts as version 2.0, but includes four additional annotation layers: (1) syntactic dependency annotations based on the Universal Dependencies system (https://universaldependencies.org/); (2) syntactic dependency annotations based on the JOS system; (3) semantic role labelling annotations; (4) named entity annotations. Semantic role labels were assigned with "bilateral-srl" (https://github.com/clarinsi/bilateral-srl), named entities with "The CLASSLA-StanfordNLP model for named entity recognition of standard Slovenian 1.0" (http://hdl.handle.net/11356/1321"), and syntactic dependency annotations (both UD and JOS) with "Parser-V3" - a predecessor of Stanza (https://pypi.org/project/stanza) and CLASSLA (https://pypi.org/project/classla).&#13;
&#13;
References:&#13;
Simon Krek, Špela Arhar Holdt, Tomaž Erjavec, Jaka Čibej, Andraz Repar, Polona Gantar, Nikola Ljubešić, Iztok Kosem and Kaja Dobrovoljc. Gigafida 2.0: The Reference Corpus of Written Standard Slovene. Proceedings of The 12th Language Resources and Evaluation Conference. Marseille, May 2020. https://www.aclweb.org/anthology/2020.lrec-1.409/&#13;
&#13;
LOGAR BERGINC, Nataša, GRČAR, Miha, BRAKUS, Marko, ERJAVEC, Tomaž, ARHAR HOLDT, Špela and KREK, Simon. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede, 2012. https://doi.org/10.4312/9789610603542
</summary>
<dc:date>2023-08-03T00:00:00Z</dc:date>
</entry>
<entry>
<title>Slovenian legal natural language inference dataset SLawNLI</title>
<link href="http://hdl.handle.net/11356/2100" rel="alternate"/>
<author>
<name>Malenšek, Miha</name>
</author>
<author>
<name>Krajnc, Saša</name>
</author>
<author>
<name>Križnar, Primož</name>
</author>
<author>
<name>Završnik, Aleš</name>
</author>
<author>
<name>Bajec, Marko</name>
</author>
<author>
<name>Žitnik, Slavko</name>
</author>
<id>http://hdl.handle.net/11356/2100</id>
<updated>2026-03-20T12:19:46Z</updated>
<published>2026-03-19T00:00:00Z</published>
<summary type="text">Slovenian legal natural language inference dataset SLawNLI
Malenšek, Miha; Krajnc, Saša; Križnar, Primož; Završnik, Aleš; Bajec, Marko; Žitnik, Slavko
SLawNLI is a human-annotated dataset for Natural Language Inference (NLI) in the Slovenian legal domain. It contains 2,214 examples constructed according to the standard NLI schema (premise, hypothesis, label). The dataset was annotated by four master's students of the Faculty of Law. All examples were hand-validated by a researcher from the Institute of Criminology and a practicing lawyer.&#13;
&#13;
The dataset is derived from four Slovenian laws:&#13;
&#13;
- Kazenski zakonik (KZ-1) — Criminal Code (https://pisrs.si/pregledPredpisa?id=ZAKO5050)&#13;
- Stvarnopravni zakonik (SPZ) — Law of Property Code (https://pisrs.si/pregledPredpisa?id=ZAKO3242)&#13;
- Zakon o varstvu osebnih podatkov (ZVOP-2) — Personal Data Protection Act (https://pisrs.si/pregledPredpisa?id=ZAKO7959)&#13;
- Obligacijski zakonik (OZ) — Obligations Code (https://pisrs.si/pregledPredpisa?id=ZAKO1263)&#13;
&#13;
The dataset is provided in JSONL format.
</summary>
<dc:date>2026-03-19T00:00:00Z</dc:date>
</entry>
<entry>
<title>Slovenian translation corpus Spook 1.1</title>
<link href="http://hdl.handle.net/11356/2077" rel="alternate"/>
<author>
<name>Vintar, Špela</name>
</author>
<author>
<name>Gorjanc, Vojko</name>
</author>
<author>
<name>Erjavec, Tomaž</name>
</author>
<author>
<name>Fišer, Darja</name>
</author>
<author>
<name>Mezeg, Adriana</name>
</author>
<id>http://hdl.handle.net/11356/2077</id>
<updated>2026-03-11T11:48:05Z</updated>
<published>2026-03-10T00:00:00Z</published>
<summary type="text">Slovenian translation corpus Spook 1.1
Vintar, Špela; Gorjanc, Vojko; Erjavec, Tomaž; Fišer, Darja; Mezeg, Adriana
The Spook corpus was compiled to enable corpus-based studies in translation and comprises 713 texts and about  375 thousand words. It is composed of three types of texts. The first comprises foreign language texts in French, English, German, and Italian. The second type are the corresponding texts is in Slovenian. These two types of texts are aligned on the sentence level and comparable in terms of genre and time of publication. The third type of texts consists of original Slovenian texts, and is comparable to the Slovenian part of the parallel corpora. &#13;
The transcription of the texts and paragraph-level alignment of the originals/transations was performed manually.&#13;
&#13;
The texts have been automatically tokenised, sentence segmented, PoS tagged and lemmatised in 2012. Linguistic processing of Slovenian texts was performed by ToTaLe (which used TnT for PoS tagging and CLOG for lemmatisation), while German, English, French and Italian texts were analysed by TreeTagger. The PoS tags in the corpus are given in two variants. One set is as output by the tagger, which is the MULTEXT-East tag for Slovenian (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), while other other sets are as output by TreeTagger for each language. The second variant of PoS tags is a mapping of the original tags to the Spook tagset (https://nl.ijs.si/spook/msd/html-en/).&#13;
&#13;
Version 1.0 was released in the scope of the project in 2021 but was available only to project participants. This version updates the TEI encoding of the corpus and changes the vertical files so that they also include the SPOOK tags as attribute-value pairs. It also removes the parallel fiction part of the corpus (2 x 35 texts) due to copyright considerations. Note, however, that these texts are included in the concordancer-mounted corpus.
</summary>
<dc:date>2026-03-10T00:00:00Z</dc:date>
</entry>
</feed>
