CLARIN.SI Projects - CLARIN Slovenia

From 2018 onwards, CLARIN.SI annually publishes a tender for developing new or upgrading already existing resources or services, or other work that furthers the CLARIN(.SI) strategy. CLARIN.SI allocates € 30,000 per year for the implementation of projects.

CLARIN.SI project reports 2025

Dataset of synonyms and distractors SYNDIST

Project leaders: Iztok Kosem, Špela Arhar Holdt (FRI UL)
Funding: 2,500 €

In the project we created triplets consisting of a headword, synonym, and distractor based on synonym data from the Thesaurus of Modern Slovene. In the first step, we selected headword-synonym pairs for approximately 5,000 headwords with the most synonyms, resulting in 51,023 pairs. We then created distractors using the large language model Gemini-2.0-flash. We also tested ChatGPT-4o, but due to poorer test results, we opted for Gemini. After generating the distractors, we combined the data and exported it to an Excel table, which we used to perform a multi-layered analysis of the distractors. We automatically assigned them frequency data from the Gigafida 2.0 corpus, while also calculating the morphological similarity of the distractors to the headwords and the synonyms using the Gestalt method. Although this was more useful for our purposes of preparing data for the language game, we nevertheless left the data in the database, which we uploaded to the CLARIN repository. At the same time, we also automatically marked (albeit infrequent) cases in which the generated distractor was the same as the synonym. The automatic preparation was followed by a manual analysis of the distractors. Two lexicographers reviewed and labelled all distractors with the decisions GOOD, BAD, PROBLEMATIC (in dubious cases). After this initial review, one of them reviewed the other one’s labels again—in this way, we can say that together they looked at approximately 30-35% of the distractors. During the second review, cases involving potential new synonyms for a given source term were also marked as poor or problematic; this was an unexpected result given the instructions in the prompt.

As the final data shows, 40,866 distractors (more than 80%) were good, 7,595 were poor, 99 were problematic, 2,438 were potential legitimate synonyms, and 25 were identical to synonyms. The review took a considerable amount of time, and we also needed technical support in processing the data and preparing it for analysis. CLARIN covered the costs of one lexicographer and the preparation of the database for the repository, while the rest was covered by the LLM4DH project and the MRIC (CJVT) infrastructure network.

The results of the project were presented at various national and international events, such as the AFRILEX 2025 conference and the eLex conference. Some of the data will also be used in the preparation of a language game that will be part of the CJVT Language Games Portal.

The dataset is available in the CLARIN.SI repository under a CC BY 4.0 license (http://hdl.handle.net/11356/2056).

Ontology of topics for Slovenian as a Second and Foreign Language

Project leader: Eva Pori (Centre for Slovene as a Second and Foreign Language, FF UL)
Project collaborators: Mihaela Knez, Matej Klemen, Tanja Jerman (CSSFL, FF UL)
Funding: 5.250 EUR

In this project, the first version of the Ontology of Topics for Slovenian as a Second and Foreign Language (ONTEM) was developed. The ONTEM can be expanded in the future and will be useful for the field of Slovenian as a second and foreign language (L2 and FL), e.g., for preparing teaching materials, language testing and research, developing lexical resources, linking dictionary data, and similar purposes. It will be possible to integrate it into the Digital Lexical Database for Slovenian and the first Dictionary for Speakers of Slovene as a Second and Foreign Language – SLOGOST.

In creating the ontology, we started from a set of 1,019 words sourced from the KUUS corpus of textbooks for learning Slovenian as a second and foreign language and from the Core Vocabulary for Slovenian as L2, organized according to CEFR levels A1, A2, and B1. The focus was on words from textbooks at levels A1 and A2; however, since the aim was to create a robust system of semantic categories that could be expanded in the future, the selected list also included vocabulary from higher levels. Experts in the field of Slovenian as L2 and FL independently reviewed the lemmas assigned to level A1 and assessed whether the assigned level was appropriate. For those lemmas where the experts fully agreed, the assigned A1 level was confirmed.

This was followed by a pilot manual annotation of a smaller subset of words (244 words) and then by broader assignment of topics for 1,019 words across up to three hierarchical levels: (I) metatopic, (II) topic, and (III) subtopic—for example, bolan (“sick”) = (I) STATE – (II) BODY AND HEALTH – (III) WELL-BEING. Each word was independently assigned topics by multiple annotators (Slovenian as L2 and FL teachers and experts). The (dis)agreement among the assigned labels was used to define the topics, provide examples for each, and produce the final version of the ontology.

ONTEM contains a total of 64 hierarchically organized topics with detailed descriptions, as well as 1,019 lemmas annotated with part-of-speech tags, CEFR level, information on whether the originally assigned CEFR level was confirmed, and placement within a metatopic, topic, and subtopic.

The dataset is available in the CLARIN.SI repository under the CC BY-NC-SA 4.0 license (http://hdl.handle.net/11356/2069).

Enhancement of the Slovene learner corpus KOST

Project leader: Mojca Stritar Kučuk (Centre for Slovene as a Second and Foreign Language, FF UL)
Collaborators: Slovene language students at Faculty of Arts, University of Ljubljana
Funding: 1,460 €

We acquired 788 new texts for the Slovene learner corpus KOST 2.0, thereby increasing it to 9,134 texts or 1.37 million words. Two-thirds of the new texts were handwritten, so three students of Slovene had to digitalise them.

Most of the new texts were acquired as part of the Year Plus program, which has been the main source of texts for KOST to date. We limited ourselves to texts that students write by hand in their end-of-semester exams and are therefore produced under controlled writing conditions. The remaining texts were obtained from various programs of the Centre for Slovene as a Second and Foreign Language at the Faculty of Arts, University of Ljubljana: Slovene language courses, Seminar on Slovene language, literature, and culture, the Youth Slovene language summer school, and Exam Centre.

One of the main criteria for selecting texts for KOST is the first language of their authors. Until now, South Slavic languages have predominated in KOST, and this has not changed with the acquisition of new texts, but the proportion of some first languages has increased significantly, which will enable new research. We have acquired texts in Serbian (227 texts), Macedonian (125), Russian (120), Bosnian (71), Croatian (69), Ukrainian (66), English (51), Malagasy (50), German (31), Italian (28), Montenegrin (27), Greek (18), Polish (16), French (15), Slovenian (14, speakers from the border region), Dutch (13), Japanese (11), Spanish (10), Czech (10), Slovak (6), Rusyn (5), Hungarian (3), Chinese (3), Kinyarwanda (3) and Arabic (1).

We have not yet expanded the corpus with manually annotated language errors.

KOST 2.1 is available on the CLARIN.SI repository under the CC BY-SA 4.0 license (http://hdl.handle.net/11356/2066).

Upgrade of the Slovene-Japanese Learner’s Dictionary sloJa

Project leader: Kristina Hmeljak Sangawa (FF, University of Ljubljana)
Developers: Katarina Hitomi Gerl, Miha Kralj, Alja Ivona Pipuš, Ana Razinger, Nina Sangawa Hmeljak, students at the Faculty of Arts, University of Ljubljana
Funding: 2,500 €

The project upgraded the Slovene-Japanese dictionary for Slovene speaking learners of Japanese sloJa 1.0 compiled in 2023. We added lemmas from the Core Vocabulary of Slovene that were not included in the first version because they were not present in the Japanese-Slovene dictionary data on which the first version was based, and added new translational equivalents and usage examples to the existing entries, using also data from the Japanese-Slovene parallel corpus JaSlo. Compared to version 1.0, version 1.1 contains 10,031 lemmas (8,464 in v1.0), 17,561 senses (15,583 in v1.0), 20,113 translational equivalents (17,595 in v1.0) and 2,048 examples of use (1,692 in v1.0).

The dictionary can be browsed via Lexonomy (https://www.lexonomy.eu/#/sloja) or downloaded from the CLARIN.SI repository with a CC-BY 4.0 licence (http://hdl.handle.net/11356/2071).

Corpus of conversational humor Krohot

Project leader: Mira Krajnc Ivič (FF UM)
Project collaborators: Larisa Mihailović (Slovene Language and Literature student at FF UM), Dominik Ivič (Philosophy and Japanology student at FF UL), Anemari Pušnik s. p., Slovene Language and Literature students at FF UL, and Darinka Verdonik (FERI UL)
Funding: 5000 €

The project was divided into several smaller work packages, e.g., defining quality requirements for the content and technical adequacy of recordings, recording footage in the field, reviewing received recordings, selection of recordings, transcription, transcription checking and correction, preparation of standardized transcription, development of a basic annotation scheme for annotation of humorous segments, manual identification and annotation of humorous segments, review of annotated segments, and revision of annotations.

In defining quality requirements for the content and technical adequacy of recordings, we followed the Guidelines for the Collection of Data for Spoken Language Resources (Smernicam za zbiranje podatkov za govorne vire, Verdonik 2024), one of the results of the MEZZANINE project. This meant that, from a technical perspective, the corpus includes recordings stored in WAV format with a sampling rate of 44.1 kHz, 16-bit depth, and a single channel (mono).

Metadata collection for speakers (gender, age, education, place of residence, language environment in childhood, etc.) and for recording conditions (recording location, domain, recording device, etc.) was carried out through the Govorjena slovenščina portal. The portal also provides solutions for the legal aspects of handling personal data. Furthermore, an application for research approval was submitted to and approved by the Faculty of Arts of the University of Maribor from the perspective of personal data management.

The speech resource predominantly consists of spontaneous narratives about past events that were humorous at the time of their occurrence or are perceived as humorous by the speaker during narration. If needed recordists received additional guidelines, e.g. speakers should engage with the storytelling by contributing humorous remarks, making self-deprecating jokes, teasing their interlocutor, using irony, etc. These additional guidelines contributed to the creation of a high-quality speech resource. All of this required additional time and effort.

The Krohot corpus contains spontaneous speech captured in ten recordings, totalling 232 minutes (nearly four hours) of spoken interaction. The corpus includes 35,271 tokens, of which 5,536 are unique word forms (types). In terms of size, KROHOT is comparable to existing humour corpora in other languages.

The material was manually segmented and transcribed in alignment with the GOS conventions (two transcription tiers as described on the Govorjena slovenščina portal), using Transcriber (files pog.trs and std.trs). The conversational transcriptions were imported together with the audio files into the Partitur Editor (EXMARaLDA; file format .exb), where the standard (orthographic) transcriptions were first generated semi-automatically and then manually reviewed and corrected. This was followed by the development of the annotation scheme, which was subsequently imported into the same tool.

The original annotation scheme was developed for the corpus, based on Krajnc & Antloga (2024) and comparative analysis of other annotated humour corpora. The core scheme comprises five annotation categories:

vocabulary (lexical choice, including figurative use),
relation (relationship between speakers),
content (topical focus),
attitude (speaker’s opinion toward the topic), and
manner (purposefully humorous way of speaking).

These categories could be combined; the corpus includes 48 distinct annotation combinations (e.g. ‘content + attitude’ and ‘attitude + content’ are considered separate labels). In total, 647 segments were annotated as instances of conversational humour.

All data in the corpus were manually annotated. In four of the recordings, humour segments were annotated by the original participants. In the remaining six, annotation was carried out by external annotators. This distinction means that some annotators had access to contextual background knowledge, while others approached the material without prior familiarity. Each annotation was verified and/or edited by at least two individuals.

The corpus is prepared in the EXMARaLDA tool EXAKT (.exs; Krohot.coma), which allows searching either by annotations on the “humor” tier (RegEx /Annotation/) or by word forms (RegEx /Transcription/) and it is available as WAV audio recordings, while the (aligned) transcriptions are given in the formats of the EXMARaLDA and Transcriber tools, as well as in plain text.

The corpus is designed primarily for linguistic analysis and is fully compatible with the existing GOS corpus of spoken Slovene. It is suitable for future extension.

Reviewing the collected material and the annotated segments required considerable time, and we also needed technical support for data processing and for preparing the data for analysis. In this part of the project as well, we collaborated with Faculty of Electrical Engineering and Computer Science (LLM4DH project).

The Krohot corpus available on the CLARIN.SI repository under the CC BY-NC-ND license (http://hdl.handle.net/11356/2065).

Tourism Corpus TURK 3.0

Project leader: Vesna Mikolič (Institute for Linguistic Studies, ZRS Koper)
Other project collaborators: Maša Rolih, Diana Košir (IJŠ ZRS Koper), Jernej Vičič (FAMNIT UP), Miro Romih (Amebis, d. o. o.), Tomaž Erjavec (IJS)
Funding received: 5.135 EUR

The project Tourism Corpus TURK: Upgrade 3.0 was carried out with the aim of upgrading the content of the corpus with newer multilingual tourism texts after 2019, updating the structure and tagging of the corpus according to a uniform taxonomy, and finally transferring the corpus from the repository of the University of Primorska to the national research infrastructure CLARIN.SI, thus facilitating wider use of the corpus among potential target audiences in business and education.

The previous version of the corpus TURK2 (2016–2024) contains 17 thousand documents or 31 million words in Slovenian, Italian, and English. In the scope of the CLARIN.SI project the corpus was supplemented with 127 documents or around 100,000 words, which were obtained from current tourism sources, in particular: the Slovenian Tourist Board (STO), Visit Ljubljana, and Visit Koper. These new texts reflect contemporary trends in Slovenian tourism after the COVID-19 pandemic, with a greater focus on sustainable, cultural, and experiential tourism.

The tagging of the corpus was carried out according to a standardized taxonomy. It consist of 26 categories for thematic areas from the field of tourism (e.g., cultural, culinary, sports, health, urban, festival, mountain, rural); the language of the text (Slovenian, Italian, English, German, unknown); text medium (spoken, electronic, written – published/unpublished, book, periodical, etc.); text genre (artistic/non-artistic; professional, journalistic, juridical, scientific, advertising, etc.); and whether the text was proofread or not.

The annotation work was carried out at the Institute for Linguistic Studies ZRS Koper. After the annotated material was transferred to the corpus by Jernej Vičič, Miro Romih, Amebis, d. o. o., linguistically annotated the texts in the CoNLL-U format, while Tomaž Erjavec (IJS) performed some corrections on the data and metadata, converted them to vertical format and prepared the corpus for publication on the CLARIN.SI repository.

The corpus provides the basis for further development of tourism terminology and for work on the growing TURS tourism dictionary.

TURK 3.0 is available under CC BY on the CLARIN.SI repository (http://hdl.handle.net/11356/2075).

CLARIN.SI project reports 2024

In 2024 supported six projects:

Enhancement of the STARK tool for analyzing syntactically parsed corpora

Project leader: Kaja Dobrovoljc (FF UL, IJS)
Developer: Outsmartify, Luka Krsnik s.p.
Funding: 3,400 €

STARK is a versatile tool designed for analyzing and understanding the structure of sentences in large text collections, known as treebanks. It works by identifying and extracting various types of syntactic structures, or ‘trees’, to reveal which structures occur in a language and how significant they are with respect to their frequency and other useful corpus linguistic metrics. In this project, we have significantly upgraded this tool with several new functionalities that ensure its long-term usability for various linguistic applications. Through a thorough revision of the underlying software code, the tool is now able to rapidly extract trees of any length, regardless of the number of words contained and the number of root elements in a sentence. We have also added the ability to ignore selected syntactic phenomena (e.g. punctuation) and to use special characters when formulating specific queries. Final testing in different computing environments confirmed the tool is suitable for analyzing complex structures and larger text corpora, and is already being actively used in several national and international projects.

The new version of the program (v3.0), along with updated documentation, is freely available on GitHub (https://github.com/clarinsi/STARK) and on the CLARIN.SI repository (http://hdl.handle.net/11356/1958). The STARK-demo web service (https://orodja.cjvt.si/stark/) which is intended to showcase the tool’s functionality to the general public, has also been updated.

The tool upgrade to version 3.0 was partially co-financed by the SPOT project (ARIS no. Z6-4617).

Training set of explanations for the coreference resolution task

Project paricipants:Aleš Žagar (UL FRI), Marko Robnik-Šikonja (UL FRI)
Funding: 4,500 €

Winograd Schema Challenge (WSC) is a dataset designed for coreference resolution tasks, focusing on semantically challenging problems and commonsense reasoning. For instance, the sentence: “The trophy doesn’t fit into the brown suitcase because it is too large.” requires understanding that “it” refers to “the trophy” based on semantic reasoning and knowledge about the size relationship between the trophy and the suitcase. We enhanced the original dataset to make it suitable for studying knowledge explanation problems and enabling knowledge-enhanced machine learning by introducing the following improvements:

Annotation of semantically or syntactically solvable examples: Some samples from the original dataset can be solved without deeper semantic processing due to the morphologically richness of Slovene. For example, the sentence: “Riba je pojedla črva. Bila je lačna.” requires only the knowledge of gender and does not require any deep semantical processing to infer that the fish was hungry and not the worm. To have a representative set of syntactical samples, we decided to create 197 new examples by modifying the existing ones.
Two-Level Knowledge ontology: We developed a hierarchical scheme to categorize knowledge required to successfully solve a problem. In our analysis, we detected 9 high-level knowledge categories (social knowledge, psychological knowledge, etc.) and 37 lower-level more nuanced knowledge (physical laws/the laws of nature, social roles, causal relationships, etc.).
Semi-Automatic Explanation Generation: Textual explanations were generated using GPT-4, followed by verification and correction by human annotators to ensure accuracy and clarity. For instance, a textual explanation for the sentence “Pokal ne gre v rjav kovček, ker je prevelik.” is “Če je nekaj preveliko, se ne prilega v manjši prostor.”.
Translation to English: The finalized explanations were translated into English using a trained translator, enabling broader applicability (not paid by the CLARIN.SI project).
SPO Triplet Generation: Subject-Predicate-Object triplets were extracted using GPT-4 to highlight key semantic relationships within each example.

The original dataset contains 804 samples. We tried to preserve the original splits into training and testing data as much as possible. All test samples in the original dataset are also present in our test set. The dataset comprises 601 training samples, 200 validation samples, and 200 test samples.

The dataset is publicly available in the CLARIN.SI repository under the title Knowledge-Enhanced Winograd Schema Challenge KE-WSC (http://hdl.handle.net/11356/1988). The test set labels are private, as the dataset is integrated into the SloBENCH evaluation framework (https://slobench.cjvt.si/).

Publication of the conference series “Language Technologies and Digital Humanities”

Applicant: Jezikava, Tina Munda s.p.
Funding: 2,000 €

The project published papers from the proceedings of the complete conference series Language Technologies and Digital Humanities, which is biennially held in Ljubljana. The goal of the project was to enable the preservation of and access to the conference papers, as well as increasing the visibility of the research covered by the conference topics. Such publication fosters open science, which, in turn, encourages high-quality research and interdisciplinary cross-fertilization, leading to development and innovation.

504 papers from 14 editions of the conference (1998-2024) have been published on the open-access, digital repository Zenodo, financed by EU and hosted by CERN. Alongside metadata and supplementary material, if applicable (video presentation, slides), the conference-paper files (PDF) are openly accessible in the Zenodo community Proceedings of the Conference Series “Language Technologies and Digital Humanities at https://zenodo.org/communities/jt-dh/.

Fun fact: “Zenodo is the name derived from Zenodotus, the first librarian of the ancient library of Alexandria and father of the first recorded use of metadata, a landmark in library history.” (source)

Building the SWOW association dataset for Slovenian

Project leader: Špela Vintar, Faculty of Arts, University of Ljubljana
Collaborators: Prevajanje, programiranje in obdelava podatkov, Mojca Brglez s.p.; Kofein dizajn d. o. o.; Students of the Digital Linguistics Master Programme
Funding: 5.000 €

Free word associations are words or phrases which come to mind spontaneously with a given stimulus or cue word (e.g. woman -> girl, mother, pretty, man etc.). Word associations provide insight into the structure and functioning of the mental lexicon and help us better understand linguistic memory and recall. In the past, they have also been used to study various deviations from the norm, i.e., established responses to cues.

Within this project we built the first word association database for Slovenian, SWOW-SL 1.0 (https://smallworldofwords.org/sl), which contains responses from over 1,100 Slovenian speakers to 1,000 Slovenian cue words, with a total of 19,898 responses. SWOW-SL is methodologically part of the umbrella project Small World of Words (https://www.smallworldofwords.org/en), which currently provides an online association collection environment for 19 languages. Participants in the experiment first provide some basic demographic data, then write up to three associations for each of 18 randomly selected cue words.

In the first stage of the project, we established the online data collection environment for Slovenian, which included translation and adaptation of the website, and the selection 1,000 cue words based on frequency in the Gigafida 2.0 corpus. Then, from May to October 2024, we conducted a crowdsourcing campaign, reaching out to participants through Facebook and Instagram social networks, while also running a physical campaign with posters and stickers. The goal was to collect at least 16 responses per cue word in total, which we achieved well before the project’s end.

Before publication in the CLARIN.SI repository (http://hdl.handle.net/11356/1980), we performed technical and linguistic processing of the collected responses technically, adding lemmas and normalized forms with corrected diacritics and capitalization. Based on the frequency of individual responses, the data is also equipped with statistically computed association strengths.

The online platform for collecting associations remains active, and in the future we aim to expand the set of cue words and then resume the collection campaign.

Implementation of support for extended use of Slovenian resources for coreference resolution

Applicants: Matej Klemen (FRI UL), Slavko Žitnik (FRI UL)
Funds awarded: 2,500 EUR

There are two main coreference resolution corpora for Slovenian: coref149 (http://hdl.handle.net/11356/1182) and SentiCoref (http://hdl.handle.net/11356/1285). To make it easier to use them, and promote their wider use, we have done the following within the project:

Developed scripts to convert the datasets from inconsistent data formats into the common CorefUD CoNLL-U format. The international CorefUD initiative aims to unify the data format for coreference resolution data through a modification of the CoNLL-U format.
Developed scripts to enable user-friendly data loading within the widely used dataset library HuggingFace.
Developed scripts for unified coreference resolution evaluation in Slovenian within the evaluation framework SloBENCH (https://slobench.cjvt.si/).

When loading the datasets, the code obtains them from CLARIN.SI repository. The data created using conversion scripts is deposited in the CLARIN.SI repository. The developed scripts have been thoroughly tested and documented.

Project results:

The converted datasets coref149 and SentiCoref in the CorefUD formats are available on the CLARIN.SI repository (http://hdl.handle.net/11356/1989, http://hdl.handle.net/11356/1990).
The coref149 in SentiCoref datasets are loaded into the HuggingFace library (https://huggingface.co/datasets/cjvt/coref149, https://huggingface.co/datasets/cjvt/senticoref).
The coreference resolution evaluation using the coref149 and SentiCoref datasets has been added to the SloBENCH evaluation framework (https://github.com/clarinsi/slobench-eval-docker/pull/3).
The developed code is archived and documented on the public Github repository (https://github.com/clarinsi/CLARINprojekt2024-koreferencnost).

Transcription model for 18th and 19th century Slovenian manuscripts

Applicant: Matija Ogrin (ZRC SAZU)
Collaborators: Marko Kunavar, Barbara Lenarčič
Funding received: 4.000 €

As many texts of older Slovenian literature from the 17th to the 19th century remained in manuscripts and therefore did not enter either the scholarly record or the general cultural reception, the aim of this project was to improve and facilitate the process of transcribing older manuscript texts in Slovenian. This was done with the help of the Transkribus tool.

The project workflow was the following:

In the preliminary stages (before 2024), we prepared digital facsimiles of the manuscripts and manually prepared several dozen pages of diplomatic transcripts of texts by Ignatius Holzapfel (1799-1866) and Franciscan Tobias Vernik (1801-1886). These texts were used as a training set in our project in 2024, on which Transkribus built a hand recognition model, specifically for Holzapfel and specifically for Vernik.
Using the improved models, we used Transkribus to produce approximately 200 pages of transcriptions for each of the two authors.
The full text (approx. 220 + 300 pages of manuscripts) was then converted to TEI XML and further edited according to the TEI Guidelines.
A common model for transcription of Slovenian manuscripts of this era was then created by combining the features of the four hands into a single model, i.e., we merged four previously constructed training sets: approx. 55,000 words of texts by Franciscan Konrad Branka (1737-1789); 20,000 words by Michael Zagajšek (1739-1827), parish priest at Kalobje; 12,000 words by Franciscan Tobias Vernik; and 93,000 words by Ignatius Holzapfel, spiritual writer and dean in Ribnica. The total training set amounts to about 170,000 words. The size of the training set for each hand varies according to the difficulty and the particularities of the handwriting of that hand. The largest is the training set for Holzapfel, which has by far the most difficult handwriting. The average CER (Character error rate) is 3.29%. The best text recognition is that of the Franciscan Tobias Vernik, because he has the most beautiful handwriting; it is slightly worse for Holzapfel’s very specific handwriting. By adding new training sets of other older writers, the model can be further improved.

The project results are:

The merged transcription model “Slovenian 18th and 19th century manuscripts” publicly available on the Transkribus web service: Model 216113.
Two TEI-encoded diplomatic editions, openly available on the CLARIN.SI repository: http://hdl.handle.net/11356/1993 and http://hdl.handle.net/11356/1995.
The four editions published with the TEI Publisher tool on the “Slovenian Baroque Literature” portal in parallel display of transcription and facsimile: : sbs_dipl_ms_206, sbs_dipl_ms_207, sbs_dipl_ms_209, sbs_dipl_ms_210.

CLARIN.SI project reports 2023

In 2023 CLARIN.SI accepted six projects for funding, which were all succesfully completed and are described below:

SemSex: Creating a semantic knowledge base on sexuality and recognizing defined concepts in educational content

Project leader: Slavko Žitnik, FRI UL
Developers: Tim Prezelj, PF & MF UL, Timotej Knez, FRI UL, Miha Štravs, FRI UL
Funding: 7,000 €

With the SemSex project (https://github.com/clarinsi/SemSex) we aimed to address at least some of the systemic shortcomings and lay a basic foundation for further systemic changes and improvements in the field of sex education in Slovenia. Because it involves a culturally sensitive topic, we tried to choose as objective, unbiased, and professionally supported methodological approach as possible to achieve the set goals. This approach should be paradigmatically useful not only within the specific framework of sexuality content but also more broadly. By introducing machine tools, we wanted to establish and test a new original theoretical-analytical approach to the analysis of the school environment, which we largely succeeded in doing. We hope that the results of the SemSex project will primarily assist decision-makers and researchers in evaluating and optimizing the sex education program in Slovenia. Additionally, we hope that the methodological framework used in this specific case will expand to the analysis and evaluation of other similar intercurricular areas, as the current system is distinctly qualitative and, therefore, insufficiently systematic.

Within the framework of the project, three activities were carried out, which are thematically interconnected:

Activity A1: We designed a hierarchically organized framework of content in the field of sex education, based on which a semantic knowledge base for the domain of sexuality was constructed.

Result D1: Semantic knowledge base in machine-readable format (RDF): https://github.com/clarinsi/SemSex#1-ontology

Activity A2: Based on the database (D1), we created a model for recognizing sentences or paragraphs related to a specific concept about sexuality.

Result D2: Code repository with the trained model for recognizing content about sexuality: https://github.com/clarinsi/SemSex#2-concept-detection; http://hdl.handle.net/11356/1894

Activity A3: We conducted a systematic, automatic (with manual review), qualitative analysis of existing curricula, aiming to identify concepts from the semantic knowledge base. Based on this, we can determine which content from the established framework is present, in what manner, and what their formal treatment is.

Result D3: Corpus of all current curricula of Slovenian primary schools with annotated specific sections on sexuality (according to the semantic knowledge base): http://hdl.handle.net/11356/1895

We intend to further develop the methodological approach described within the SemSex project even after the official conclusion of the project, especially since part of the results has already received recognition in a scientific publication (Prezelj, 2023).

ZRCalo: Redesign of the typeface for the ZRCola 2 input system

Project leader: Janoš Ježovnik, ZRC SAZU
Developers: Nace Pušnik (external contractor), Duša Divjak Race, Carmen Kenda-Jež, ZRC SAZU
Funding: 5,000 €

The first phase of the typeface renewal, funded by CLARIN.SI, consisted of the preparation of the ZRCalo typeface with up to 100 characters. The typeface will eventually replace the ZRCola font as a component of the ZRCola 2 input system (http://hdl.handle.net/11356/1090). As part of Phase 1, uppercase and lowercase letters were prepared which are part of the Slovenian alphabet. In addition, certain diacritical marks were also prepared, properly combined by means of component linking. The current version of the font comprises a total of 384 characters, mainly those included in the Unicode blocks Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B and Latin Extended Additional, as well as in certain other blocks. The technical specificities reflected in the characters produced have been edited and adapted accordingly in a dedicated environment. The metric and kerning properties of the letters were adapted already in this phase, especially for the basic character set. As the number of characters increases, the metrics will continue to be edited, as the other newly created characters will also need to be properly combined with the current character set. An open source version (open type format, .otf) has been created from the working file that allows the use of the font across different systems.

The result is published on the CLARIN.SI repository under the CC-BY licence (http://hdl.handle.net/11356/1884).

Website with a comprehensive overview of Slovenian corpus annotation levels

Applicant: Tina Munda, CJVT UL
Funding: 2,000 €

Within the scope of the project, web pages have been developed to provide information on linguistic annotation of Slovenian corpora on the CJVT Wiki, which is available in Slovenian at https://wiki.cjvt.si/shelves/jezikoslovno-oznacevanje-korpusov and in English at https://wiki.cjvt.si/shelves/linguistic-annotation-of-slovene-corpora.

The upgrade covers the overview of the following corpus annotation levels: tokenisation, sentence segmentation, lemmatisation, and JOS/MULTEXT-East morphosyntactic descriptions, JOS syntax, Universal Dependencies (UD) syntax, semantic role labelling (SRL), named entities (NER), coreferences, and relations. It also includes specialised systems for annotating language corrections in the Šolar (Slovene student texts) and KOST (texts by speakers of Slovene as a foreign language) corpora. The section on each annotation level contains an introduction, explanation of tags or processes, annotation guidelines, and relevant references and links.

A by-product of this project are TSV files containing Slovenian and English tags for the relevant annotation systems, useful for e.g. building TEI headers in XML files.

In preparation for future enhancements of the website, detailed instructions for content addition have also been added – available in Slovenian only. To increase visibility and user engagement, the project was promoted on social media and in newsletters.

Corpus-based Slovene-Japanese Learner’s Dictionary

Project leader: Kristina Hmeljak Sangawa, FF UL
Developers: Jan Hrastnik, student at FMF UL; Nina Sangawa Hmeljak, student at FRI UL; Laura Barovič Božjak, Nadja Bostič, Katarina Hitomi Gerl, Nina Kališnik, Sara Kleč, Eva Kovač in Jure Tomše, students at FF UL
Funding: 3,500 €

The project compiled a Slovene-Japanese online dictionary for Slovene speaking learners of Japanese. Extracting and converting data from an existing Japanese-Slovene dictionary, jaSlo 3.1 (http://hdl.handle.net/11356/1050) with 9,891 lemmas, we obtained a preliminary Slovene-Japanese dictionary draft, first automatically and then manually cleaned doubled or inappropriate entries, labelled the Slovene headwords with part-of-speech and difficulty tags according to the CEFR scale as available in the Core Vocabulary of Slovene (http://hdl.handle.net/11356/1697), and manually edited all entries using Lexonomy.

Senses of polysemous words and corresponding translation equivalents were manually glossed with hints on their meaning, in part also with examples, extracted from the Japanese-Slovene parallel corpus jaSlo (https://nl.ijs.si/jaslo/#parallel) and manually adapted for the learner’s dictionary. Japanese translational equivalents from different registers were tagged according to their level of politeness and with notes on usage restrictions aimed at dictionary users who are learning Japanese as a foreign language.

The dictionary can be browsed via Lexonomy, at https://www.lexonomy.eu/#/sloJa, or downloaded from the CLARIN.SI repository with a CC-BY 4.0 licence, at http://hdl.handle.net/11356/1898.

Creation of a learning dataset of annotated automatically extracted collocation data

Applicant: Iztok Kosem, CJVT UL
Annotators and advisors: Rebeka Roblek, Karolina Zgaga, Bojan Klemenc, Polona Gantar
Funds received: €8,500

As part of the project, a learning dataset of 713,310 collocation candidates was created. The collocation candidates were automatically extracted from the Gigafida 2.0 reference corpus and annotated according to their collocation relevance. Collocation candidates (minimum frequency = 4) were extracted for three syntactic structures, which, in addition to being among the most common syntactic structures in the Slovenian language, are also semantically the most informative:

Verb + noun in the accusative case (163,229 collocation candidates)
Adjective + noun (342,714 collocation candidates)
Noun + noun in genitive case (207,367 collocation candidates).

In the annotation process, three decisions were possible: Yes – a good collocation candidate (syntactic appropriateness and semantic meaning), Extended – a conditionally good collocation candidate, which very often or always has a third element (without which it sometimes does not make sense), No – a bad collocation candidate.

The learning dataset, which will also be integrated into the Digital Dictionary for Slovenian at the Centre for Language Resources and Technologies of the University of Ljubljana, is available in the CLARIN.SI repository under the CC BY-SA 4.0 license at http://hdl.handle.net/11356/1903.

A Ukrainian parliamentary corpus for research on code switching

Project leader: Anna Kryvenko, INZ (Slovenia) & NISS (Ukraine)
Developers: Matyáš Kopp, Charles University (Czechia), Andriana Rii, student at Lviv Polytechnic National University (Ukraine)
Funding: 8,000 €

The project compiled the Ukrainian parliamentary corpus ParlaMint-UA 4.0.1, which is an extended version of the ParlaMint-UA 4.0 corpus, compiled as part of the “ParlaMint: Towards Comparable Parliamentary Corpora” project funded by CLARIN ERIC and available from http://hdl.handle.net/11356/1859 and http://hdl.handle.net/11356/1860.

Compared to ParlaMint-UA 4.0, ParlaMint-UA 4.0.1 has doubled its size and the time-span. Now it comprises almost 42 million words and includes older data between 2002 and 2012 as well as more recent data between September and November 2023. More detail about the ParlaMint-UA 4.0.1 can be found at https://ufal.github.io/ParlaMint-UA/

Importantly, this project initiated the development of code-switching makeup in the Ukrainian parliamentary corpus by enhancing language identification between Ukrainian and Russian from the paragraph level to the sentence level. The lingua-py library was chosen for language identification. It needs to be emphasised that the official language of the Ukrainian parliament has always been Ukrainian. Tokens in Russian comprise only 6% in the source texts and are practically not observed after mid-2019, when the Law on Protecting the Functioning of the Ukrainian Language as the State Language came into effect. However, with Ukrainian-Russian bilingualism still spread in contemporary Ukrainian society, language choices that social actors made in plenary meetings in the recent past were neither irrational nor unnoticed by the voters. Motivations and mechanisms underlying these choices are of great interest to scholars in various fields of the Social Sciences and Humanities as well as to the general public.

We believe that the Ukrainian parliamentary corpus ParlaMint-UA 4.0.1 will be a handy resource contributing to research on parliamentary discourse at large and advancing corpus-based studies on code switching in institutional contexts.

The result of the project is published in the CLARIN.SI repository under the Creative Commons – Attribution 4.0 International (CC BY 4.0) licence and can be downloaded from http://hdl.handle.net/11356/1900.

CLARIN.SI project reports 2022

In 2022 CLARIN.SI accepted six projects for funding, all of which were successfully concluded and are listed below:

Online service for advanced querying of Slovenian Universal Dependencies treebanks

Project leader: Kaja Dobrovoljc, FF UL
Developer: Miha Štravs, FRI UL student
Funding: 5,000 €

Drevesnik (https://orodja.cjvt.si/drevesnik/) is an online service aimed at linguists and other researchers that enables querying syntactically parsed corpora in Slovenian with easy-to-use query language on the one hand and user-friendly graph visualisations on the other. It is based on the open-source dep_search tool, which was localized and modified so as to also support querying by JOS morphosyntactic tags, random distribution of results and filtering by sentence length. The query language is explained on a separate help page containing several illustrative examples and enables users to search for a wide spectrum of grammatical phenomena, from single words to complex syntactic structures. The results are displayed as dependency trees (graphs) and can also be downloaded in various formats. Currently, there are three corpora available for browsing – the manually annotated reference treebanks of written (SSJ) and spoken Slovenian (SST), and the automatically parsed ccKres corpus – however, new corpora in CONLL-U can also be added.

The Drevesnik source code and the documentation is publicly available at https://github.com/clarinsi/drevesnik, while version 1.1 can also be downloaded from the CLARIN.SI repository:

Štravs, Miha and Dobrovoljc, Kaja, 2023, Service for querying dependency treebanks Drevesnik 1.1, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1923.

The KUSS corpus of textbooks for learning Slovenian as a second and foreign language and basic vocabulary lists for levels A1, A2 and B1

Project leader: Matej Klemen, FF UL
Other collaborators: Špela Arhar Holdt, FF+FRI UL, Damjan Huber, FF UL, Iztok Kosem, FF+FRI UL, Mateja Lutar, FF UL, Senja Pollak, IJS
Funding: 2.500 €

The project developed a corpus of textbooks for learning Slovenian as a second and foreign language, KUUS, and, by analyzing its vocabulary, a core vocabulary list for levels A1, A2 and B1 according to Common European Framework of Reference for Languages (CEFR). The KUUS corpus comprises 17 textbooks for Slovenian as a second and foreign language published by the Centre for Slovene as a Second and Foreign Language, which are currently widely used in teaching Slovenian as a second and foreign language to children, adolescents and adults in Slovenia and abroad. The KUUS corpus consists of 520,796 words. It is annotated and includes relevant metadata about the textbooks.

Word lists for each level of language proficiency have a long tradition in foreign language learning. For Slovenian as a second and foreign language, they have been included in various ways in language documents, e.g. in the Preživetvena raven za slovenščino (Survival Level for Slovenian; Pirih Svetina et al. 2004), the Sporazumevalni prag za slovenščino (Threshold Level for the Slovenian Language; Ferbežar et al. 2004), etc., and have been prepared as a consensus of the authors of the individual documents. In the project, however, we have prepared a list based on a corpus approach, combining vocabulary for different levels in one document.

We exported words or lemmas from the KUUS corpus and defined robust numerical criteria, which were used to assign words to the CEFR level label: A1-core, A1-broader, A2, B1, etc. We checked whether each word appeared in textbooks as well as in the Reference List of Common Common Vocabulary (Pollak et al. 2020, http://hdl.handle.net/11356/1346). Words that were tagged A1, A2, B1 and at the same time were not part of the Reference List of Slovene Frequent Common Words were manually reviewed and content-categorised. A certain proportion of these words were identified as relevant candidates for inclusion in the core vocabulary lists marked with CEFR levels: e.g., we included typical textbook linguistic terminology (e.g. poved, pogojnik, modalen). Thus, the list of core vocabulary in the current version consists of 350 words tagged A1-core, 864 words tagged A1-broader, 1451 words tagged A2, and 2608 words at B1 level; 5273 words in total.

The results of the project are available in the CLARIN.SI repository in two entries under the ACA ID-BY-NC-INF-NORED 1.0 licence:

Corpus of textbooks for learning Slovenian as L2 KUUS 1.0: http://hdl.handle.net/11356/1696
Core vocabulary for Slovenian as L2 1.0: http://hdl.handle.net/11356/1697

The preparation and compilation of the corpus and the production of the core vocabulary list for levels A1, A2 and B1 according to CEFR are presented in more detail in the paper:

KLEMEN, Matej, ARHAR HOLDT, Špela, POLLAK, Senja, KOSEM, Iztok, HUBER, Damjan, LUTAR, Mateja, 2022: Korpus učbenikov za učenje slovenščine kot drugega in tujega jezika. Nataša Pirih Svetina, Ina Ferbežar (eds.): Na stičišču svetov: slovenščina kot drugi in tuji jezik. Obdobja 41. Ljubljana: Založba Univerze v Ljubljani. 165–174. DOI: https://doi.org/10.4312/Obdobja.41.2784-7152.

Parallel Corpus of Idiomatic Texts ParaDiom

Project leader: Gregor Donaj, UM FERI
Other collaborators: Špela Antloga, UM FERI
Funding: 6,000 EUR

ParaDiom is a parallel corpus with sentences sampled from existing corpora. The corpus contains 1,000 Slovene sentences with their English translation and 1,000 English sentences with their Slovene translations. The sampled sentences contain idioms, similes, and proverbs, which are annotated in the corpus.

Sentences were sampled based on a selection of 100 Slovene and 92 English idioms and similes by searching through sentences in the corpora ccGigafida, ParlaMint, and The Corpus of Late Modern English Texts. All sampled sentences were tagged with MULTEXT-East MSD tags, Universal Dependencies morphological features and lemmas using Stanza for English and CLASSLA for Slovene sentences. Some idioms were found as part of proverbs, which were also annotated. Half of the sampled sentences were translated by hand, and the other half were translated using machine translation and post-editing. We used the Q-CAT annotation tool to annotate the idiomatic expressions. The annotated noun, adjective and adverbial idioms were given the label MWE ID (‘idiomatric multiword expression’), verb idioms MWE VID (‘verbal idiomatic multiword expression’), similes MWE SIM (‘simile’), and proverbs MWE P (‘proverb’).

The results of the project are available in the CLARIN.SI repository under the CC BY-NC-SA 4.0 license:

Parallel corpus of idiomatic text ParaDiom 1.0; http://hdl.handle.net/11356/1714.

Compilation of the SI-NLI Slovene Natural Language Inference Dataset

Project leader: Matej Klemen, UL FRI
Other collaborators: Aleš Žagar, UL FRI, Jaka Čibej, UL CJVT, Marko Robnik-Šikonja, UL FRI
Funding: 10,000 EUR

SI-NLI (Slovene Natural Language Inference Dataset) is a dataset that enables training models to identify natural inference relations for a set of sentence pairs. For instance, the premise “Med bregoma teče pet metrov široka reka.” (A five-meter-wide river flows before me.) and the hypothesis “Skočil sem z enega na drugi breg.” (I jumped from one riverbank to another.) are annotated as a contradiction (because it is physically impossible for a human to make a jump that long). The dataset was compiled using sentences that occur in Slovene reference corpora. The main goal of the compilation process was to generate varied and diverse examples. An overview of related datasets for English revealed that many of them contain simple examples, which can cause language models to rely on surface-level features instead of logical inference. The final dataset contains 5,937 sentence pairs and is divided into a training set (4,392 examples), a validation set (547 examples), and a testing set (998 examples). The division of examples was done using Slovene BERT-type language models to ensure that both simple and complex examples are uniformly distributed among all three subsets.

The dataset was compiled using a semi-automatic approach consisting of two steps. In the first step, candidate sentences pairs (e.g. a premise and a hypothesis) were extracted from the ccKres 1.0 Reference Corpus of Slovene using a neural sentence encoder. In the second step, the annotators were tasked with modifying the suggested hypothesis (or generating a new hypothesis) for each premise and each language inference relation: entailment (E), neutrality (N), and contradiction (C). The process was described in more detail in guidelines designed to ensure that the generated examples were as diverse as possible. For instance, the guidelines state that simple negation is unsuitable for generating examples for contradiction (C). Every example was annotated by at least two annotators, with some examples additionally annotated by a third annotator. The SI-NLI dataset thus enables a thorough analysis of inference capabilities of Slovene language models, and because particular attention was given to the process of generating diverse examples, the dataset is a methodological improvement not only for Slovene language resources, but in general.

The results of the project are available as follows:

The source code for extracting candidate sentence pairs for further annotation and training natural language inference models is available at https://github.com/clarinsi/si-nli.
The dataset is available at the CLARIN.SI repository: Slovene Natural Language Inference Dataset SI-NLI, http://hdl.handle.net/11356/1707. Because it is also implemented in the SloBENCH evaluation framework (https://slobench.cjvt.si/), the test set labels are private.
We have trained two Slovene natural language inference models on the compiled dataset: a monolingual SloBERTa model (which achieves an accuracy of 73.5%) and a multilingual CroSloEngual model (with an accuracy of 67.3%). The models are publicly available in the HuggingFace model repository at https://huggingface.co/cjvt/sloberta-si-nli, https://huggingface.co/cjvt/crosloengual-bert-si-nli.

Compiling a corpus of political party programmes for the 2022 Parliamentary Election

Project leader: Andrej Pančur, INZ
Other collaborators: Petra Polanič, Filip Dobranić, INZ
Funding requested: 2,500 EUR

The corpus includes programmes used by political parties to participate in the 2022 Slovenian Parliamentary Election on April 24, 2022. The programmes were included as they were published on the parties’ websites up until the day before the election.

The text of an individual party’s programme was stored in a separate file, with the exception of the parties Naša prihodnost and Dobra država, which ran together for the election and were thus treated together, i.e. are stored in one file. Each file was first converted into .txt format, which is unchanged as regards the original except for some specific elements that were excluded in all programmes they were appeared in: introductions of the programme by party leaders, table of contents, names of candidates and districts they run in, and longer quotations (for example from other party documents or the party congress). The text of the programme was examined and cleaned of parts such as text that appeared twice, headers and footers, text that was a part of figure, descriptions and sources of photos. In the case of two parties that published their programmes as text on their websites, the corpus doesn’t include unfinished sections of the programmes (explicit statements that the section is still being edited or sample text). The text of the programmes was annotated using the CLASSLA tool and converted into the CONLL-U format.

The 19 programmes consist of a total of 330,559 tokens. The shortest programme in the corpus is the programme by the Lista Borisa Popoviča party (264 tokens) and the longest is the Socialni demokrati party programme (67,071 tokens). The metadata features the party name, its URL and the URL of the programme.

The corpus allows us to examine the programmes and compare their content based on linguistic features (for example by comparing the most frequent adjectives in individual programmes, the presence and frequencies of specific words and phrases and so on). There are considerable differences among parties in their presentation of their programmes; from obvious differences in length to different formats and visual elements (such as the use of graphs, photos, charts). Systematic ways of documenting such elements could be valuable in upgrading this corpus or developing similar corpora in the future. Additionally, the linguistic and content analysis of political party programmes could greatly benefit from larger corpora that include programmes produced over a longer period of time, allowing for comparison within one party through the years, among the most frequently mentioned topics of each election, and so on.

The results of the project are available in the CLARIN.SI repository under the CC BY-NC-SA 4.0 license:

Corpus of political party programs Programi2022; http://hdl.handle.net/11356/1734.

Automatic speech recognition test dataset for SloBench platform

Project leader: Darinka Verdonik, UM FERI
Other collaborators: Andreja Bizjak, Simona Majhenič, UM FERI
Funding: 6,000 EUR

In 2021, the evaluation platform SloBench (https://slobench.cjvt.si/) was established. Its goal is to enable independent evaluation of language technologies tools for the Slovenian language. Evaluation data are hidden. In the SloBench ASR project we have prepared the text dataset for speech recognition evaluation for the Slovenian language. The data includes the recordings and speakers which are, according to our knowledge, not present in the available speech databases for the Slovenian language. The data is structured as followed:

15 recordings in total duration 3h 18min 28sec (3:18:28)
public speech in total duration 2:08:35sec and private speech in total duration 1:09:53
9 recordings in total duration 2:03:04 from south-western part of Slovenia and 6 recordings in total duration 1:15:24 from north-eastern part of Slovenia
18 male speakers and 19 female speakers
public speech includes topics evolution, description of a settlement, scientific slam, description of a life, culture of speech, news, books, energetics, and the private speech includes 4 monologues and 3 dialogues between two persons
in private speech, 10 speakers are recorded, 3 of them are up to 30 years old, 5 of them are between 30 and 49 years old and 2 of them are over 50 years old

All of the recordings are manually transcribed in two modes, the colloquial transcription and the standardised transcription (Verdonik et al. 2013), following the same standards as those used for the transcription of the Artur speech database for ASR, developed in the ‘Development of Slovene in a Digital Environment’ project and available on the CLARIN.SI repository.

SloBench speech recognition test dataset recordings are available on https://slobench.cjvt.si/leaderboard/view/10. Transcriptions are used for performance of evaluation. Results of the evaluation are published on https://slobench.cjvt.si/leaderboard/view/10.

CLARIN.SI project reports 2021

In 2021 CLARIN.SI accepted four projects for funding, however, only three were successfully completed and are described below.

Extractions from KAS corpus

Applicants: Aleš Žagar, Matic Kavaš and Marko Robnik-Šikonja, University of Ljubljana, Faculty of Computer and Information Science
Funds awarded:9,500 €

The Corpus of Academic Slovene KAS 1.0 (http://hdl.handle.net/11356/1244) contains BSc, MSc, and PhD theses from the Slovene open science portal in the amount of approximately 82,000 documents, and there also exists a separate repository entry (http://hdl.handle.net/11356/1420) with the abstracts from the KAS corpus. The analysis of the KAS corpus showed that many documents are unsatisfactorily extracted and structured. The inconsistencies we detected are, e.g. mixed Slovenian and English abstracts and keywords, absence of abstracts, other texts instead of abstracts, non-segmented texts, non-existent text classifications, noisy extraction of some text elements, etc. So far, datasets extracted from the corpus have not included summaries nor exploited the coexistence of English and Slovenian abstracts for machine translation.

The project produced a cleaner version of the KAS corpus with added segmentation into chapters, and updated its PoS-tagging. The updated corpus of abstracts contains less noise and contains language labeled abstracts. We extracted approximately 72,000 Slovenian and 54,000 English abstracts. Using machine learning models, we improved the metadata, supplementing about half of the missing information on the CERIF research areas. From extracted texts and summaries we created several new datasets: a monolingual (72.000 examples) and cross-lingual dataset (54.000 examples) for summarizing long academic texts, and a dataset of aligned sentences from summaries in English and Slovene suitable for training or evaluating machine translation systems. We created three versions of the machine translation dataset with different reliability of alignments: default alignment contains approximately 497 thousand pairs, more reliable alignment 475 thousand, and highly reliable alignment 426 thousand translation pairs.

The program code is available in the repository https://github.com/korpus-kas. With the program code it is possible to extract texts and abstracts, built models for the classification of research areas of individual works and align sentences of abstracts written in English and Slovenian.

The corpora and datasets are published in the CLARIN.SI repository:

Corpus of Academic Slovene KAS 2.0: http://hdl.handle.net/11356/1448
Abstracts from the KAS corpus KAS-Abs 2.0: http://hdl.handle.net/11356/1449
Summarization datasets from the KAS corpus KAS-Sum 1.0: http://hdl.handle.net/11356/1446
Machine Translation datasets from the KAS corpus KAS-MT 1.0: http://hdl.handle.net/11356/1447

We describe the procedures for extracting and preparing the datasets in the paper:

Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society – IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228

SloBENCH: Design and implementation of an evaluation framework for language technologies

Applicants: Slavko Žitnik, Simon Krek, Marko Robnik-Šikonja and Frenk Dragar, University of Ljubljana, Faculty for computer and information science
Funds awarded: 10,000 €

There are a number of tasks that are important for the development of the natural language processing of a specific language. Examples of such tasks are automatic summarization, translation, part-of-speech tagging and information extraction techniques. Language resources and technologies are available through various platforms (e.g., the CLARIN.SI repository) but their objective comparison is not done end-to-end or uniformly.

The results of this project provide a number of possibilities for overview and transparency over the landscape of developed tools and resources for the Slovenian language. The SloBENCH tool is a Web portal containing publicly available leaderboards for an arbitrary natural language processing task. It allows for multiple user roles for adding, editing and creation of new leaderboard versions. Web services implement automatic evaluations and specific implementation or calculation of benchmarking scores for each leaderboard. Evaluation tools that are part of SloBENCH are published and maintained in the public CLARIN.SI source code repository. For simplicity of testing, they enable running each evaluation tool separately to anyone interested in how evaluation is done within SloBENCH.

The initial version of SloBENCH contains evaluation scripts with examples of training and testing datasets for nine different tasks: named entity recognition, part-of-speech tagging, lemmatization, dependency parsing, semantic role labelling, translation (ENG-SLO, SLO-ENG), summarization and question answering.

After the end of the project, the maintenance will be performed by CJVT. Apart from the internal source code repository of the SloBENCH portal and its documentation within CJVT, the project provides the following public resources:

Portal https://slobench.cjvt.si: Main public access to all available leaderboards
Evaluation framework: https://github.com/clarinsi/slobench-eval-docker
Public DockerHub repository with pre-built Docker images, used by SloBENCH: https://hub.docker.com/r/slobench/eval/tags.

Corpus of metaphorical expressions in spoken Slovene language G-KOMET

Applicant: Špela Antloga, Faculty of Electrical Engineering, Computer Science and Informatics, University of Maribor
Funds awarded: 6,000 €

G-KOMET (a corpus of metaphorical expressions in spoken Slovene language) is an upgrade of the hand-annotated written corpus for metaphorical expressions KOMET 1.0 with transcriptions of speech and conversation that covers 50.000 lexical units. The corpus includes a balanced set of transcriptions of informative, educational, entertaining, private, and public discourse. It contains hand-annotated metaphor-related words, i.e. linguistic expressions that have the potential for people to interpret them as metaphors, idioms, i.e. multi-word units in which at least one word has been used metaphorically, and metonymies, expressions that we use, to express something else.

The annotation scheme was based on the MIPVU metaphor identification process. This protocol was modified and adapted to the specifics of the Slovene language and the specifics of the spoken language. Corpus was annotated for the following relations to metaphor: indirect metaphor, direct metaphor, borderline cases and metaphor signals. In addition, the corpus introduces a new ‘frame’ tag, which gives information about a concept to which it refers. This conceptual frame allows us to search for figurative expressions within a specific context category (e.g. time, spatial orientation, emotions etc.). Metonymies were furthermore categorized based on the specific metonymic mapping. Corpus of metaphorical expressions in spoken Slovene language G-KOMET allows an objective and systematic analysis of metaphorical expressions, metaphors and metonymies in various Slovene texts.

The corpus is published in the CLARIN.SI repository:

Corpus of metaphorical expressions in spoken Slovene language G-KOMET 1.0: http://hdl.handle.net/11356/1490.

CLARIN.SI project reports 2020

In 2020 CLARIN.SI received fewer project proposals than in previous years, to a large extent due to the intensive involvement of almost all consortium members in the RSDO project. Three projects were selected for funding, however, one of the projects did not start due to copyright problems with its data. The two successfully concluded projects are described below.

Tutorial on the siParl 2.0 corpus: Voices of the parliament – Corpus linguistics approach to parliamentary discourse

Applicant: Kristina Pahor de Maiti, Faculty of Arts, University of Ljubljana
Funds awarded: € 5,000

CLARIN.SI and Slovene researchers have immensely contributed towards the development of parliamentary corpora and an improved understanding of the potential of parliamentary corpora for the researchers in the European scientific community (via the development of annotation recommendations and parliamentary corpora for several languages, an overview of available parliamentary corpora, organization of content-related international scientific events). However, in the Slovene scientific community, this potential has not yet been fully recognized nor exploited, which is why we used this project to create a tutorial that could help close this gap.

The aim of this project was to create a user-friendly, methodologically sound and research-relevant tutorial that demonstrates the potential of linguistic corpora for the analysis of socio-cultural phenomena through language use in specialized discourse. The tutorial is based on the siParl 2.0 corpus (http://hdl.handle.net/11356/1300) which contains records of Slovene parliamentary debates from 1990–2018, while the analytical tool used was the CLARIN.SI noSketch Engine concordancer (https://www.clarin.si/noske/), i.e. the siParl 2.0 corpus available through this tool.

The tutorial starts with a brief theoretical introduction covering the peculiarities of the specialized political discourse and the effect of gender on communication practices as well as providing an explanation of the most popular corpus analysis techniques. The main part of the tutorial comprises three tasks in which we use various corpus analysis techniques to better understand women’s position in the Slovene parliament. We adopt a step-by-step approach in order to guide the reader from formulating queries and analytical procedures to the interpretation of the results. In addition, we supply screencasts for each task that demonstrate the use of a concordancer which helps the user in carrying out the showcased procedures independently.

While the tutorial uses Slovene corpus data, the analyses demonstrated in the tutorial can also be performed on similar parliamentary corpora in other languages as well as generalized to investigate other types of linguistic corpora. On the one hand, this encourages international comparison of parliamentary culture and discourse, and on the other hand, promotes cross-disciplinary exchange of methodological approaches. In order to reach the international audience, there is a Slovene and English version of the tutorial available.

The tutorial is available in both Slovene and English in the digital library of the Institute of Contemporary History:

FIŠER, Darja, PAHOR DE MAITI, Kristina. Voices of the parliament: “First, I’m a Female Politician, Not a Male One, and Second …”: a corpus approach to parliamentary discourse research. Institute of Contemporary History, 2021. ISBN 978-961-7104-06-6. https://sidih.github.io/voices/index.html.

The compilation of the MEMIS epigraphic corpus of Medieval and Early Modern inscriptions in Slovenia

Applicant: Gregor Pobežin, Insitute for Cultural History ZRC SAZU
Funds awarded: € 4,000

In the project “Epigraphic corpus of Medieval and Early Modern inscriptions in Slovenia MEMIS 1.0” 51 Latin inscriptions from the Medieval and Early modern period (dating between 1222 and mid- 17^th century) were collected, catalogued, processed in XML and translated; in its present extent, the corpus contains only the inscriptions from the Slovenian coastal towns, particularly Koper and Piran, i.e. all inscriptions either still present in their primary context as well as those that were at some point moved or even destroyed and therefore only available in transcripts. The material for the corpus was collected during field research by collecting the inscriptions in situ.

The inscriptions contained in the corpus are fully expanded (i.e., the abbreviatures and ligatures), commented and translated, with most of the relevant epigraphic metadata; for this purpose, the EpiDoc template XML file was used, which facilitates the processing of metadata.

The purpose of the corpus is to create a methodological basis for the processing of medieval and early modern inscriptions in Latin (and in vernacular languages), which are located or were discovered in the area of the Slovenian ethnic territory. The corpus, which will be published as an integrated source within the DARIAH.SI infrastructure will enable the systematic processing and publication of a rich (written) cultural heritage, which, unlike Roman-era inscriptions, has not been addressed thus far in a scientific manner.

Epigraphic corpus MEMIS 1.0 is available under the CC BY-NC-SA 4.0 licence at:

Pobežin, Gregor, 2020, Epigraphic corpus of Medieval and Early Modern inscriptions in Slovenia MEMIS 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1376.

CLARIN.SI project reports 2019

Following the successful introduction of this initiative in 2018, CLARIN.SI in 2019 again launched a call for project proposals for the members of its consortium. The call, with a budget of €30,000, targeted projects that either built or upgraded resources or services that contribute to the advancement of the CLARIN.SI mission. Six project proposals were accepted with their descriptions and the resources they produced given below.

Tool for statistical analysis of dependency-parsed corpora

Applicants: Kaja Dobrovoljc (FF UL), Marko Robnik Šikonja (FRI UL)
Funds awarded: € 6,000

Within the project, we have developed a computer program for statistical analysis of parsed corpora (the STARK tool) that produces frequency lists of trees from dependency parsed corpora. The user defines the type of trees to be extracted through several parameters in the configuration file, such as the number of nodes in the tree and their type (from word forms to abstract grammatical categories), and the potential differentiation of trees based on their completeness, labelling and surface word order. In addition to such bottom-up approach to dependency tree extraction which does not rely on any linguistic assumptions, the tool also enables tree extractions based on additional restrictions and queries with pre-defined tree structures. The results are displayed in the form of a tabular text file with information on the tree structure and its nodes as well as on the corpus frequency and the strength of statistical association between nodes through different association measures. The tool expects the standard CONLL-U format as input, making it directly applicable not only to Slovenian corpora, such as the ssj500k treebank or the 1-billion-word Gigafida reference corpus but also to more than 70 other languages with the same type of data already available.

The STARK command-line tool is publicly available under the Apache 2.0 license at https://github.com/clarinsi/STARK, and can also be downloaded from the CLARIN.SI repository:

Krsnik, Luka; Dobrovoljc, Kaja and Robnik-Šikonja, Marko, 2019, Dependency tree extraction tool STARK 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1284.

Establishing access to historical versions of Slovenian language reference corpus Gigafida

Applicant: Andraž Repar, CJVT
Funds awarded: € 1,500

The CLARIN.SI concordancers noSketch Engine and KonText offered only the latest version of the Slovenian reference corpus Gigafida v2.0. This corpus has newly linguistically annotated and, contrary to its previous versions, does not include non-standard texts.

For various reasons, it is sometimes necessary to access previous versions of the Gigafida corpus, e.g. to access the removed non-standard texts (this is esp. relevant for research on Slovenian in neighbouring countries because sources containing this version of Slovene (e.g., the newsletter Novi Matajur) were removed due to their nonstandardness). Additionally, access to the older versions enables replicability of previous research performed on this corpus.

The project enabled the previous versions of the Gigafida corpus (FidaPLUS, Gigafida 1.0 and Gigafida 1.1) to be mounted on the CLARIN.SI noSketch Engine and KonText platforms. The plan was also to mount the first version of the corpus, so-called FIDA, where agreements were signed with the copyright holders, i.e. the companies Amebis, d.o.o. and DZS, d.d. Unfortunately, the project funds were only sufficient to cover the copyright transfer from DZS to the Ljubljana University, with none left over to enable the transfer of data from CD ROMs to the digital form necessary to publish them on the two CLARIN.SI platforms.

CLARIN.SI noSketch Engine and KonText now mount the following versions of the Gigafida corpus:

Gigafida v2.0 proto (nededupliciran): noSketch Engine, KonText,
Gigafida v2.0 (dedupliciran): noSketch Engine, KonText,
Gigafida v1.1 (nededupliciran): noSketch Engine, KonText,
Gigafida v1.1 dedup (dedupliciran): noSketch Engine, KonText,
Gigafida v1.0: noSketch Engine, KonText,
FidaPLUS: noSketch Engine, KonText.

Corpus for Slovene coreference resolution and aspect-based sentiment analysis–SentiCoref 1.0

Applicant: Slavko Žitnik, FRI UL
Funds awarded: € 6,000

The aim of the project was to compile the SentiCoref 1.0 corpus which includes sentiment annotations for specific entities in the text. In addition to the sentiment level annotation, coreferences and named entities were also tagged. Named entities include person names, organization names and locations. Each named entity is annotated along with all the coreferent mentions that refer to an underlying entity. The corpus enables better coreference resolution analyses and aspect-based sentiment analysis for the Slovene language.

SentiCoref 1.0 corpus contains texts from SentiNews 1.0 corpus (Bučar, 2017) that consists of 10,427 documents. Each of the documents from SentiNews 1.0 corpus is annotated with a five-level sentiment on a level of document, paragraph and sentence. SentiCoref 1.0 consists of 837 documents selected from SentiNews 1.0 based on the number of named entities (automatically tagged using Polyglot tool) which contain between 50 to 73 named entities.

SentiCoref 1.0 corpus consists of 31,419 named entities: 15,285 organization names, 8,606 person names and 7,528 locations. All the documents form 14,572 coreference chains (i.e., entities) with 438,733 entity mentions. Entities are annotated using the following sentiment levels: very negative: 30 entities; negative: 1,801 entities; neutral: 10,869 entities; positive: 1,705 entities; very positive: 24 entities.

The SentiCoref 1.0 corpus along with the annotation guidelines is available under the CC BY 4.0 licence:

Žitnik, Slavko, 2019, Slovene corpus for aspect-based sentiment analysis – SentiCoref 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1285.

Speech corpus of dialogue acts GORDAN 1.0

Applicant: Darinka Verdonik, Faculty of Electrical Engineering, Computer Science and Informatics, University of Maribor
Funds awarded 6,000 €

During the project Speech corpus of dialogue acts GORDAN 1.0 we developed the dialogue act corpus for Slovene. The corpus contains a balanced sample of different types of spoken discourse in total length of one hour. The data was selected from previously existing Slovene corpora (the GOS, Gos VideoLectures and BERTA) according to four criteria: public/non-public, interactive/monologic, channel and intention.

Before selecting and defining the annotation scheme, four well-established schemes (MRDA, AMI, ISO 24617-2 and DART) were evaluated based on the following criteria: ensuring annotation of pragmatic meaning, coherent structure, general validity and well-balanced structure. Substantial drawbacks regarding these criteria were found in all of the existent schemes. Based on these findings, we have defined the GORDAN 1.0 scheme which keeps the advantages of the analysed schemes and overcomes their drawbacks.

The selected data has been annotated in accordance with the GORDAN 1.0 annotation scheme in the Transcriber 1.5.1 tool using its function Event. If the video recording was available, the annotators used multimodal data coupling the audio and video recording.

The data are available as two separate datasets:

Zwitter Vitez, Ana; et al., 2020, Dialogue act annotated spoken corpus GORDAN 1.0 (audio/video), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1292: includes original audio recordings (and video recordings if available) that are downloadable under the original licence, i.e., CC BY-NC-ND 4.0.
Verdonik, Darinka, 2020, Dialogue act annotated spoken corpus GORDAN 1.0 (transcription), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1291: includes transcriptions, dialogue act annotations and the GORDAN 1.0 annotation scheme specification that can be distributed under the CC BY 4.0 licence.

Slovene metaphor corpus Komet 1.0

Applicant: Špela Antloga, FERI UM
Funds awarded: € 4,000

The Komet corpus is a hand-annotated corpus of metaphorical expressions which covers 200,000 lexical units from Slovene journalistic, fiction and on-line texts. Metaphors are a complex phenomenon that can be rendered on the linguistic level by novel and creative expressions, or strongly lexicalized units that are hardly noticeable as metaphorical. Understanding the complexity of metaphor phenomenon and the need for clearly defined guidelines for metaphor identification in texts, a group of English linguists developed a procedure for metaphor identification in text: the MIPVU protocol. Since the research on metaphors in Slovene has been very unsystematic and vague, an adapted and modified procedure was used to create a Slovene corpus of metaphors. In this corpus, lexical units (words) without the same contextual and basic meaning are considered metaphor-related words. Basic and contextual meaning for each word in the corpus was defined using the Dictionary of the standard Slovene Language. Corpus was annotated for the following relations to metaphor: indirect metaphor, direct metaphor, borderline cases and metaphor signals. In addition, the corpus is also annotated with conceptual frames which holds information about a concept to which it refers. This conceptual frame allows us to search for figurative expressions within a specific context category (e.g., time, spatial orientation, emotions, etc.). The Slovene metaphor corpus Komet enables objective and systematic analysis of metaphorical expressions and metaphors in various Slovene texts.

The Komet corpus is available under the CC BY-NC-SA 4.0 licence:

Antloga, Špela, 2020, Metaphor corpus KOMET 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1293.

Placing new orthographical rules on the Fran portal

Applicant: ZRC SAZU, the Scientific Research Center of the Slovenian Academy of Sciences and Arts
Funds received: € 6,000

The project launched a public presentation of new orthographical rules and corresponding dictionary entries to supplement the rules on the Fran portal. The drafts of the first two chapters of the orthographical rules allow users to participate in a public debate on the adequacy of the proposed solutions and their content.

For this purpose, each chapter of the new spelling rules was converted from Word .docx text format to TEI XML, with a converter developed for this purpose. In this way, orthographical rules are available in accordance with international recommendations, thus facilitating their further development, maintenance, and distribution, as well as their connectivity and adaptability to different uses. The TEI-encoded rules are linked to the Slovenian Normative Guide Dictionary (e-Pravopis).

Namely, simultaneously with the revision of individual chapters of the orthographical rules at the Fran Ramovš Institute of the Slovenian Language, ZRC SAZU is creating a Slovenian Normative Guide Dictionary – ePravopis. Linking ePravopis to the rules is an important step because it combines the information from the dictionary and the rules. Users are thus offered insights that such guides have not allowed so far.

The first two chapters are available on the Fran portal, while the TEI files will be available on the CLARIN.SI repository under CC BY-NC 4.0 license once all the chapters have been prepared.

CLARIN.SI project reports 2018

In 2018, CLARIN.SI launched for the first time a call for project proposals for the members of its consortium. The call, with a budget of €30,000, targeted projects that either built or upgraded resources or services that contribute to the advancement of the CLARIN.SI mission. Seven project proposals were accepted with their descriptions and the resources they produced given below.

Upgrade of the eZISS digital library of text-critical editions of Slovene literature

Applicants: Andrej Pančur, INZ, Matija Ogrin, ZRC SAZU
Funds awarded: € 4,000

The project has upgraded two very complex and extensive editions which include diverse components and realise a variety of text-critical concepts of analysing and displaying texts. In addition, it developed a significantly improved display of the electronic edition, its internal structure (transcriptions, digital facsimiles, notes, critical apparatus, the accompanying scientific commentary) as well as the links between the components. The existing XSLT transformations from the GitHub repository (https://github.com/SIstory/Stylesheets) have been adapted and for the purpose of ensuring the dynamic display of the parallel sections upgraded with XSLT 3.0 transformation for SAXON-JS. The XSLT transformations are accessible in the “Profiles” folder at https://github.com/DARIAH-SI/Foglar-pub and https://github.com/DARIAH-SI/Kapelski-pub.

The project has also entailed editorial work on both editions:

Kapelski pasijon (The Železna Kapla Passion Play): the tagging in the TEI markup language has been improved, the scientific commentary has been partly reorganised and all the transcriptions have been linked with the associated digital facsimile files and mutual references.
Foglarjev rokopis (The Foglar Manuscript): a complete digital edition has been created with the diplomatic and critical transcription of the manuscript, including the apparatus of variants for the several handwritten versions of the poems under consideration. The edition has been prepared by Nina Ditmajer. Both transcriptions have been linked with digital facsimiles and their tagging in the TEI markup language has been adapted to the various possibilities of displaying and linking the texts.

An important motive and aspect of the process upgrade is its usefulness for the future digital editions of the eZISS library in the context of the DARIAH-SI research infrastructure. Namely, DARIAH-SI aims to establish a TEI-based digital library enabling the presentation of complex digital editions such as the Železna Kaplja Passion Play or Foglar Manuscript, and a connection to the corpus analysis services at CLARIN.SI.

The Železna kaplja passion play is accessible at:

The Foglar manuscript is accessible at:

https://dariah-si.github.io/Foglar-pub/

The corpus of parliamentary minutes of the National Assembly of the Republic of Slovenia 1990-2018

Applicant: Andrej Pančur, INZ
Funds awarded: € 3,000

During the project, the siParl corpus has been created. It contains all the parliamentary minutes of the National Assembly of the Republic of Slovenia between 1990 and 2018 (until the end of the 7th legislative period) as well as all the minutes of the National Assembly’s working bodies since 1996, all of which amounts to almost 230 million of tokens in total. The parliamentary minutes from the 1990–1992 period have been obtained from the existing SlovParl 2.0 corpus, while the rest of the minutes have been newly tagged. The tagging has been completed in the TEI module for drama texts and converted into the TEI module for speech transcription. The corpus includes data about the speeches and the speakers, non-verbal content of the session minutes and relevant metadata. The content of the speeches has also been linguistically tagged, i.e. tokenised, morphosyntactically tagged and lemmatised.

The siParl corpus is available through the concordance software and for download under the CC BY licence:

Pančur, Andrej; Erjavec, Tomaž; Ojsteršek, Mihael; Šorn, Mojca and Blaj Hribar, Neja, 2019, Slovenian parliamentary corpus siParl 1.0 (1990-2018), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1236.

Assigning stress to the Sloleks lexicon

Applicant: Špela Arhar Holdt, CJVT UL
Funds received: € 5,000

In the project, the latest version of the Sloleks morphological lexicon of Slovene was improved with the addition of automatically assigned accents, a portion of which were also manually evaluated. The interface of the lexicon was also upgraded to facilitate crowdsourcing of the newly added data. The project focused on the lemmas in which the position of the accent is fixed on the word stem. In the first step, accents were automatically assigned to all word forms in the lexicon. Through existing dictionary resources, 55% of the automatically assigned accents were confirmed with an estimated accuracy of 75%. 24% of the lexicon data was processed manually, the majority with the use of crowdsourcing. Counting both the results of automatic as well as manual approaches, the project corrected 21.7% of the automatically assigned accents. Future work will include proper nouns as well as lemmas with non-fixed accents and accent variants.

The project also upgraded the design of the user interface: (a) by implementing the graphic design developed for CJVT resources; and (b) by upgrading the interface with features that allow the community to participate in database clean-up (i.e. allowing them to upvote/downvote the assigned accents, the automatically generated phonetic transcriptions and the text-to-speech pronunciation). Additional functionalities are also being developed as part of other ongoing projects, such as the possibility for users to contribute recordings of their own pronunciation of words.

The database is available under the CC BY-NC-SA licence:

Dobrovoljc, Kaja; et al., 2019, Morphological lexicon Sloleks 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1230.

The compilation of word and n-gram lists for various levels of education and for different subjects

Applicant: Iztok Kosem, Faculty of Arts, University of Ljubljana
Funds awarded: € 4,000

The project involved compiling a corpus of textbooks used in Slovenian elementary and secondary schools, and extracting word and n-gram lists and keywords. The collected textbooks were available in PDF and html formats and were converted into text format. Afterwards, the converted texts were examined, issues corrected and then the texts were POS-tagged. The corpus contains around 5 million tokens or 127 textbooks from 16 different subjects. The second step involved the extraction of word lists, etc., and several measures to ensure the quality of data. In addition, lists were manually analysed. The final result are represented by the following lists:

List of general words occurring in at least 8 out of 16 subjects. The list contains information on lemma, word form and frequency (by education level and the number of subjects).
List of general words by education level (grade/year) containing information on lemma, word class, frequency and number of levels in which the lemma was found.
List of 2-5-grams containing the word-forms of the n-gram, its lemmas, word classes and POS-tags, its frequency and the number of subjects in which the n-gram was found.

The lists are available under 4.0 CC BY licence:

Kosem, Iztok; Pori, Eva and Arhar Holdt, Špela, 2019, Keywords and n-grams from a textbook corpus, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1215.

A Tool for efficient analysis of Slovene corpora

Applicants: Marko Robnik Šikonja, Špela Arhar Holdt, UL FRI
Funds awarded: € 4,000

In the project, a clear and comprehensible user interface for the corpusStatistics tool (renamed to LIST) was developed. The tool offers a user-friendly access to language statistics in corpora of Slovene and other languages. The tool was adapted for several corpus formats and tested on large corpora of Slovene and other languages.

The program now includes metadata in all outputs which enables the reproducibility of the results. The elements of the user interface contain short explanations shown on mouse-over. Several new association measures of word sequences are supported, e.g. Dice, t-score, MI, and MI3. The program now estimates the time needed to return the results and warns users for settings which may require a longer processing time. Users can now also switch between the Slovene and English interfaces and can process non-Latin scripts. The program was upgraded with support for the TEI P5 format used for recently published corpora in the CLARIN.SI repository, and the vertical format (VERT) used by SketchEngine.

The LIST program is available under the Apache2 open licence:

Krsnik, Luka; et al., 2019, Corpus extraction tool LIST 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1227.

Gos Videolectures II

Applicants: Darinka Verdonik, Andrej Žgank, UM FERI
Funds awarded: € 6,000

The goal of the Gos VideoLectures II project was to enlarge the existing Gos VideoLectures database with additional 8 hours of manually performed transcriptions of the selected speeches from Videolectures.net database. Transcriptions were done in two-level transcriptions system, where the first level represents conversational transcription, and the second level represents standardized transcription as defined in the Slovene GOS corpus. The speech signal was manually segmented to utterances/segments and notable acoustic events were manually annotated. The second goal was to automatically segment speech signal to words and to a restricted list of phonemes. This was done with the adapted version of the automatic speech recognition system for Slovene UMB Broadcast News developed at the Faculty of Electrical Engineering and Computer Science of University of Maribor.

Same as in the previous versions of the Gos VideoLectures database, the conversion of transcriptions from the Transcriber 1.5.1 XML format to metadata files in TEI (module for speech corpora) was performed. Conversion to TEI includes the list of speakers with their metadata, metadata about speeches, co-alignment of utterances and sentences with speech signal, coding of acoustic events and alignment between conversational and standardized transcription. Additionally, words in standardized transcription were automatically lemmatized and tagged with MSD. Along with the conversion, validation of the source files was done and a number of errors was detected and corrected. Based on the TEI files, we created a vertical file that is needed for the import of the database into CLARIN.SI concordancers. Audio files were also adapted for the import into concordancers so that it is now possible to listen to the recordings while searching through Gos VideoLectures corpus.

The corpus is available through CLARIN.SI concordancers and for download:

Verdonik, Darinka et al., 2019, Spoken corpus Gos VideoLectures 4.0 (transcription), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1223.
VideoLectures.NET, 2019, Spoken corpus Gos VideoLectures 4.0 (audio), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1222.

The multimedia database of the dictionary of the clothing terminology of the Zilja local dialect of Canale Valley (Val Canale – Kanaltal – Valcjanâl)

Applicant: Carmen Kenda-Jež, ZRC SAZU
Funds awarded: € 4,000

The multimedia database for The Dictionary of the Clothing terminology of the Zilja Local dialect of Canale Valley, published on the FRAN portal, was created from the collection of dialect material previously used for two printed editions of the dictionary. The transfer to the digital environment resulted in the formal adaptation to the new media (e.g. the manner of data presentation, replacement of the abbreviations with their full form or the unification of grammatical qualifiers) and in the range of microstructural changes that were caused by the self-contained presentation of the online dictionary entry and its direct links to the sound clip collection. The final version of the online dictionary is therefore substantially different from its printed versions.

The dictionary, which contains 594 entries, was transformed from the Word format to the dictionary database in XML and equipped with the intra- and inter-dictionary links. The original collection of sound clips was revised. Sound clips of lower quality (e.g. those with overlapping speech) were eliminated. If possible, the new sound material was gathered with additional analysis of previously used soundtracks. The sound clips were linked with dialect lemmas and examples.

Selected photographic material from the ethnographic research archive of clothing culture has been added to the database. For some of the entries the connection with the ethnographic online collection Glasovi Kanalske doline (The voices of Canale Valley) of the Zborzbirk project Kulturna dediščina v zbirkah med Alpami in Krasom (Cultural heritage in the collections between the Alps and the Karst) has been established. The portal Fran gives access to the monographs of the local Ziljsko dialect –Ovčja vas in njena slovenska govorica (Ovčja vas and its Slovenian Speech), 2005; Lipalja vas in njena slovenska govorica (Lipalja vas and its Slovenian Speech, 2016), providing open access as a part of the project.

The database is available at:

Kenda-Jež, Karmen; Perdih, Andrej and Race, Duša, 2019, The Dictionary of the Clothing Terminology of the Zilja Dialect of Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1217.
Gliha Komac, Nataša; Kandutsch, Elisa; Bartaloth, Rudi and Smole, Matevž, 2019, The Dictionary of the Clothing Terminology of the Zilja Local Dialect of Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl): photographs, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1221.
Kenda-Jež, Karmen, 2019, The Dictionary of the Clothing Terminology of the Zilja Dialect of Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl): audio, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1220.

Projects in which CLARIN.SI participates

CLARIN ParlaMint

CLARIN.SI (Jožef Stefan Institute and Institute for contemporary history) participated in the project “ParlaMint: Towards Comparable Parliamentary Corpora” (phase I: 2020-2021, phase II: 2022-2023), financed by CLARIN ERIC.

In the first phase of the project we developed comparable corpora of 17 European national parliamentary debates 2015-2020, while the second phase developed corpora for 29 parliaments 2015-2022 and included other extensions, e.g. the corpora machine translated to Enlgish. Samples are available on GitHub, while the complete corpora from phase II can be downloaded from the CLARIN.SI repository, in three versions:

Erjavec, Tomaž et al., 2023, Multilingual comparable corpora of parliamentary debates ParlaMint 4.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1859.
Erjavec, Tomaž et al., 2023, Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 4.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1860.
Kuzman, Taja et al., 2023, Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 4.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1864.

The background, corpus creation process and the corpora produced in the first phase of the project are described in:

Erjavec, T., Ogrodniczuk, M., Osenova, P. et al. The ParlaMint corpora of parliamentary proceedings. Language Resources & Evaluation (2022). https://doi.org/10.1007/s10579-021-09574-0.

Upgrade of CLARIN.SI: Corpus summarizer and text analyzer (SLOKIT)

CLARIN.SI (Jožef Stefan Institute) participated in the CLARIN.SI Upgrade: Corpus Summarizer and Text Analyzer project, which was funded by the Ministry of Culture of the Republic of Slovenia in 2022-2023. The project partner was the Slovenian Association of Disabled Students, and infrastructure support was provided by the Centre for Language Resources and Technologies of the University of Ljubljana.

The following activities were carried out within the project:

The Korpusnik tool that summarizes data from various Slovenian corpora
Segmentation of texts into individual articles of Delo and Dnevnik newspapers in the Gigafida corpus, and the annotation of the segmented articles with thematic categories. The results will be implemented in the next version of the Gigafida corpus, namely 2.2.
Upgrading of data in the Gos reference corpus of spoken Slovene, especially at the level of morphosyntactic annotation, and alignment of transcriptions and sound recordings. The new version of the Gos 2.1 corpus is available in the CLARIN.SI repository.
The SENTA tool for text simplification and analysis.

When developing the Korpusnik and SENTA tools, special attention was paid to accessibility for the users with special needs.

Development of Slovene in the Digital Environment (RSDO)

CLARIN.SI participates in the project “Development of Slovene in a Digital Environment” (2020-2022), which is supported by the Ministry of Culture of the Republic of Slovenia. The project aims to satisfy the increasing need for services, tools and language resources in the field of language technologies for the Slovenian language. The products are aimed at research organizations, companies and the general public.

As a part of the project, CLARIN.SI leads the sixth work package “Maintenance of an infrastructure centre for language resources and technologies“. Within the work package CLARIN.SI takes care of the public availability of the language resources, which are created within the project. When creating a language resource, international standards and good practices are taken into account, while the published resources will be safely archived in the CLARIN.SI repository, with the produced corpora also available through the online CLARIN.SI concordancers.

Development of RI-SI-CLARIN

Project “Development of research infrastructure for the international competitiveness of Slovene RRI space RI-SI-CLARIN” was carried out within the Operational Program for the Implementation of European Cohesion Policy in the period 2014-2020, specifically in 2019-2021. The projects financed new equipment for the infrastructure, and was financed in the amount of € 477,932.82 incl. tax.

As part of the project, the following purchases of research equipment have been carried out:

Jožef Stefan Institute: 2 clusters of high-performance computers with associated equipment for the purposes of faster and fault-tolerant CLARIN.SI web services, mainly the repository platform, web concordancers and services for automatic linguistic text annotation;
University of Ljubljana: a high-performance server for storing and accessing language resources, managed by the Centre for Language Resources and Technologies of the University of Ljubljana;
University of Maribor: a cluster of GPU servers in 2019 and its upgrade in 2021, intended for research that uses deep learning; high-performance servers for processing large language data; and the work of a technician.

RDA Node Slovenia

CLARIN.SI in 2020-2021 participated in the establishment of the national RDA Node, “RDA Node Slovenia”, which acts as a long-term central contact point between the Research Data Alliance and data practitioners. RDA Node Slovenia is coordinated by the Slovenian Social Science Data Archives. The RDA Node Slovenia data community was initially composed of the representatives of the Humanities (DARIAH-SI) and Linguistics (CLARIN.SI) research data infrastructures, and the University of Ljubljana.

CLARIN.SI, within the working group titled “Coordination of infrastructure data services”, produced the following overview of Slovenian data repositories:

Meden, K., & Erjavec, T. (2021). Pregled Slovenskih repozitorijev raziskovalnih podatkov. CLARIN.SI. [PDF] [DOCX]