CLARIN.SI data & tools

CLARIN.SI data & tools http://hdl.handle.net/11356/1024 CLARIN.SI repository language resources and tools Mon, 27 Jul 2026 09:04:43 GMT 2026-07-27T09:04:43Z Sample from the audiobook "Zemlja se z nami pogreza" (The Earth sinks with us) http://hdl.handle.net/11356/2271 Sample from the audiobook "Zemlja se z nami pogreza" (The Earth sinks with us) Sivec, Ivan This entry contains the first part of the audiobook "Zemlja se z nami pogreza" (The Earth sinks with us) by author Ivan Sivec (COBISS ID: 281238019, ISBN: 978-961-291-556-8). The Slovenian writer Ivan Sivec is the country's most prolific author and has been one of its most widely read writers for more than fifteen years. He graduated in Slavic Studies from the Faculty of Arts at the University of Ljubljana and also earned a master's degree in ethnology. Four feature films and television series have been adapted from his works. He is the recipient of the Order of Merit, one of Slovenia's highest state honours, awarded for his outstanding contributions to the Slovenian nation. He is also listed among the most notable and distinguished Slovenians on the Notable People website. In addition, he has received numerous other awards, including the Souvan Award for Lifetime Achievement in recognition of his enduring contribution to the treasury of Slovenian literature. Wed, 22 Jul 2026 00:00:00 GMT http://hdl.handle.net/11356/2271 2026-07-22T00:00:00Z Sample from the audiobook "Baton in Roki" (Baton and Roki) http://hdl.handle.net/11356/2287 Sample from the audiobook "Baton in Roki" (Baton and Roki) Milun, Koraljka This entry contains the first part of the audiobook "Baton in Roki" (Baton and Roki) by author Koraljka Milun (COBISS ID: 276748547, ISBN: 978-961-7279-17-7). The best things happen when you least expect them. One rainy day, fifth-grader Roki looks out the window and sees someone toss a tiny puppy out of a passing car. Without a moment's hesitation, he wraps the frightened pup in the softest towel he can find and holds him close. From that day on, Roki and Baton are inseparable, and when summer vacation finally arrives, a whole series of adventures awaits them. But after an unfortunate misunderstanding, people in the neighborhood begin to believe that the black dog, Baton, is dangerous. How can Roki clear his best friend's name? And what might happen when the two of them set off on an adventure all by themselves? Thu, 23 Jul 2026 00:00:00 GMT http://hdl.handle.net/11356/2287 2026-07-23T00:00:00Z Sample from the audiobook "Izgubljeno obzorje" (Lost horizon) http://hdl.handle.net/11356/2276 Sample from the audiobook "Izgubljeno obzorje" (Lost horizon) Hilton, James; Košnik, Nejc This entry contains the first part of the audiobook "Izgubljeno obzorje" (Lost horizon) by author James Hilton (COBISS ID: 278854659, ISBN: 978-961-7198-88-1). "Lost horizon", a thrilling and timeless novel, is a masterpiece of fantasy literature and one of the great classics of the twentieth century. It is a captivating story of revolution, utopia, human emotion, and adventure, set in a hidden mountain sanctuary known only as Shangri-La. James Hilton's bestselling adventure novel follows a soldier who, deep in Tibet, discovers what may be humanity's greatest hope for peace: the valley of Shangri-La. Hugh Conway witnessed humanity at its worst while fighting in the trenches of the First World War. Now, more than a decade later, Conway is a British diplomat stationed in Afghanistan, once again confronted by the realities of war. This time, a civil conflict forces him to flee the country by plane. Sun, 26 Jul 2026 00:00:00 GMT http://hdl.handle.net/11356/2276 2026-07-26T00:00:00Z Sample from the audiobook "Dragi pretekli jaz, za tole mi boš plačal" (Dear past self, you'll pay for this) http://hdl.handle.net/11356/2286 Sample from the audiobook "Dragi pretekli jaz, za tole mi boš plačal" (Dear past self, you'll pay for this) Grgičević, Andrijana This entry contains the first part of the audiobook "Dragi pretekli jaz, za tole mi boš plačal" (Dear past self, you'll pay for this) by author Andrijana Grgičević (COBISS ID: 276749059, ISBN: 978-961-7279-18-4). "Dear past Me, you'll pay for This" is a witty and heartfelt story about an elementary school student who sends letters back to his younger self in an attempt to fix his mistakes, improve his grades, untangle romantic mishaps, and avoid embarrassing situations—from falling into puddles to awkward misunderstandings at the grocery store. But time and again, life teaches him that not everything can be changed, and that some of our greatest lessons come from the mistakes and setbacks we'd rather avoid. Sun, 26 Jul 2026 00:00:00 GMT http://hdl.handle.net/11356/2286 2026-07-26T00:00:00Z Sample from the audiobook "Njam, njam, marmelada" (Yum, yum, jam) http://hdl.handle.net/11356/2285 Sample from the audiobook "Njam, njam, marmelada" (Yum, yum, jam) Koren, Majda This entry contains the first part of the audiobook "Njam, njam, marmelada" (Yum, yum, jam) by author Majda Koren (COBISS ID: 280747011, ISBN: 978-961-7279-31-3). "At the Eat & Drink Inn, Auntie Kuha was busy making pancakes for the Petelinček family. Then she went to the pantry to fetch some jam. But an unpleasant surprise was waiting for her there! Someone had been sneaking into the pantry at night and licking the jam straight from the jars! Who could it possibly have been? Will Auntie Kuha get to the bottom of the mystery?" Sun, 26 Jul 2026 00:00:00 GMT http://hdl.handle.net/11356/2285 2026-07-26T00:00:00Z Sample from the audiobook "Kako sklatiti zvezdo z neba" (How to bring a star down from the sky) http://hdl.handle.net/11356/2284 Sample from the audiobook "Kako sklatiti zvezdo z neba" (How to bring a star down from the sky) Bauer, Jana This entry contains the first part of the audiobook "Kako sklatiti zvezdo z neba" (How to bring a star down from the sky) by author Jana Bauer (COBISS ID: 274409987, ISBN: 978-961-7279-13-9). "Imagine what it's like to fall in love like an elephant. That's exactly what happened to Jure Muha. Even more astonishing—and downright crazy—is that he caught a star from the sky for Lili Zvezdnik. But if a star stays on Earth for too long, it begins to fade. The problem is, Lili doesn't want to give it back! It looks like another case for Oton Kobilica! This time, he'll be joined by Lučka Zvedav, with whom he drinks hot cocoa on Saturday mornings." Thu, 23 Jul 2026 00:00:00 GMT http://hdl.handle.net/11356/2284 2026-07-23T00:00:00Z Sample from the audiobook "Ti pa kar greš v bitko za pomlad" (Off you go into the battle for spring) http://hdl.handle.net/11356/2270 Sample from the audiobook "Ti pa kar greš v bitko za pomlad" (Off you go into the battle for spring) Hudolin, Jurij This entry contains the first part of the audiobook "Ti pa kar greš v bitko za pomlad" (Off you go into the battle for spring) by author Irena Pajnik Beguš (COBISS ID: 285824771, ISBN: 978-961-291-566-7). "Walking Through Istria. A Slow Journey Along Familiar Paths, with New Depth—and a Backpack Full of Thoughts. What happens when a poet decides to experience Istria—not the tourist-filled coastline, but the real, inward Istria—on foot? When a writer who lives in the hamlet of Grupija sets out on a long walk to Pula, where he spent his childhood? The result is a book that is far more than a travelogue. It is a walking memoir—a deeply personal, witty, and poetic narrative that transcends the boundaries of genre. In "Off you go into the battle for spring", Hudolin—true to himself—measures not kilometres but inner distances. As he walks through villages, over rolling hills, past vineyards, and lingers over Istrian meals, he reflects on his life, his writing, society, and the natural world. He recalls anecdotes from his school days—he attended primary school in Istria—chats with friends and strangers alike, converses with himself, and, in doing so, with you." Sun, 26 Jul 2026 00:00:00 GMT http://hdl.handle.net/11356/2270 2026-07-26T00:00:00Z Pragmatics understanding benchmark for Czech, Slovenian and Croatian PragMega CzeSloCro http://hdl.handle.net/11356/2261 Pragmatics understanding benchmark for Czech, Slovenian and Croatian PragMega CzeSloCro Vintar, Špela; Brglez, Mojca; Potočnjak, Mirna; Žižkova, Hana; Sangawa Hmeljak, Nina PragMega CzeSloCro is a translation and adaptation of a section of the PragMega dataset (Floyd et al., 2026) into Czech, Slovenian and Croatian. The original dataset was manually crafted by psychologists and aimed at discovering whether "pragmatic inferencing" is a result of a single cognitive skill or, on the contrary, of different dissociable skills depending on the type of phenomena encountered. A Slovenian version of the dataset was created first, for which we selected three tasks: Irony, Metaphor, and Humour. These consist of 50, 30, and 25 examples, respectively, or 105 examples in total. The Slovenian benchmark is described in Brglez & Vintar (2026) and is implemented as SloPragMega on the SloBench (https://slobench.cjvt.si) platform. The dataset was then translated into Croatian and Czech by students of Digital Linguistics within a student project, then thoroughly revised by two professional linguists. Due to the highly nuanced and culturally specific nature of the dataset, some tasks were completely rewritten or replaced by more naturally sounding examples in the respective language. For each language, the dataset is divided into 3 subfolders (Metaphor, Irony, Humor), which contain the following files: - stim.csv: The actual localised tasks with possible answers, - stim_en.csv: The original English tasks with possible answers, - stimOrder.csv: Template to create a randomized test with the order of questions and answers reshuffled, - keys.csv: Solutions for the shuffled tasks. References: Brglez, M., & Vintar, S. (2026). From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene. In The Fourth Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL 2026) (pp. 44–54). European Language Resources Association (ELRA). https://doi.org/10.63317/4bpncy453r9k Floyd, S., Gibson, E., Fedorenko, E., & Poliak, M. (2026, January 14). PragMega. https://doi.org/10.17605/OSF.IO/DPGE6 Tue, 21 Jul 2026 00:00:00 GMT http://hdl.handle.net/11356/2261 2026-07-21T00:00:00Z Monitor corpus of Slovene Trendi 2026-06 http://hdl.handle.net/11356/2259 Monitor corpus of Slovene Trendi 2026-06 Kosem, Iztok; Čibej, Jaka; Dobrovoljc, Kaja; Erjavec, Tomaž; Ljubešić, Nikola; Ponikvar, Primož; Šinkec, Mihael; Krek, Simon The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 62 publishers. Trendi 2026-06 covers the period from January 2019 to June 2026, complementing the Gigafida 2.2 reference corpus of written Slovene (http://hdl.handle.net/11356/2106). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem (iztok.kosem@ijs.si). This version adds texts from June 2026. Mon, 15 Jun 2026 00:00:00 GMT http://hdl.handle.net/11356/2259 2026-06-15T00:00:00Z Speech-level sentiment dataset of Slovenian parliamentary debates ParlaSent-SI 1.0 http://hdl.handle.net/11356/2256 Speech-level sentiment dataset of Slovenian parliamentary debates ParlaSent-SI 1.0 Meden, Katja; Logar, Tamara The dataset comprises 1,000 manually annotated full utterances (i.e., speeches) from the parliamentary proceedings of Slovenia, extracted from the ParlaMint-SI 4.1 corpus (http://hdl.handle.net/11356/1912). The manual annotation campaign closely follows the setup used for the ParlaSent 1.0 multilingual sentiment dataset of parliamentary debates (http://hdl.handle.net/11356/1868), which provides sentiment annotations at sentence level. The ParlaSent-SI instances were randomly sampled and each speech was independently annotated by two trained annotators. The annotators underwent extensive training and also participated in the sentence-level sentiment annotation for the ParlaSent 1.0 dataset. The six-level annotation schema, originally based on the framework proposed by Batanović et al. (2020, DOI: https://doi.org/10.1371/journal.pone.0242050), was retained from the sentence-level annotation campaign and only minimally adapted in wording to suit full-utterance annotation: • Positive for utterances that are predominantly positive • Negative for utterances that are predominantly negative • M_Positive for utterances that convey an ambiguous sentiment or a mixture of sentiments, but lean more towards the positive sentiment • M_Negative for utterances that convey an ambiguous sentiment or a mixture of sentiments, but lean more towards the negative sentiment • P_Neutral for utterances that only contain non-sentiment-related statements, but still lean more towards the positive sentiment • N_Neutral for utterances that only contain non-sentiment-related statements, but still lean more towards the negative sentiment. The final annotation for each utterance was determined in a separate reconciliation session, where the annotators reviewed their disagreements and agreed on the final tag. The 3-class labels (Positive, Negative, Neutral) are also provided. The dataset includes both procedural (i.e., those spoken by the session chair) and non-procedural parliamentary utterances. Procedural utterances are indicated in the "chair" column. Inter-annotator agreement (Krippendorff’s α) is reported for the full dataset and the non-procedural subset: 6-class schema: 0.724 (full dataset), 0.570 (non-procedural subset) 3-class schema: 0.852 (full dataset), 0.744 (non-procedural subset) The datasets are provided in both TSV and JSON formats and contain the initial annotations, annotator comments, procedural/non-procedural flag, flag for hard cases and the reconciled final label for 6- and 3-class sentiment annotation. Thu, 25 Jun 2026 00:00:00 GMT http://hdl.handle.net/11356/2256 2026-06-25T00:00:00Z