Project in which CLARIN.SI participates
CLARIN.SI (Jožef Stefan Institute and Institute for contemporary history) participate in the project “ParlaMint: Towards Comparable Parliamentary Corpora” (Phase 1: 2020-2021, Phase 2: 2022-2023), financed by CLARIN ERIC.
In the first phase of the project we developed comparable corpora of 17 European national parliamentary debates 2015-2020. Samples are available on GitHub, while the complete corpora from phase 1 can be downloaded from the CLARIN.SI repository, in two versions:
- Erjavec, Tomaž; et al., 2021, Multilingual comparable corpora of parliamentary debates ParlaMint 2.1, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1432.
- Erjavec, Tomaž; et al., 2021, Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.1, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1431.
The background, corpus creation process and the corpora produced in the first phase of the project are described in:
- Erjavec, T., Ogrodniczuk, M., Osenova, P. et al. The ParlaMint corpora of parliamentary proceedings. Language Resources & Evaluation (2022). https://doi.org/10.1007/s10579-021-09574-0.
We are now working on extending the ParlaMint corpora with newer proceedings, and with new countries, languages, and modalities, cf. the CLARIN ERIC ParlaMint project description.
Development of Slovene in the Digital Environment (RSDO)
CLARIN.SI participates in the project “Development of Slovene in a Digital Environment” (2020-2022), which is supported by the Ministry of Culture of the Republic of Slovenia. The project aims to satisfy the increasing need for services, tools and language resources in the field of language technologies for the Slovenian language. The products are aimed at research organizations, companies and the general public.
As a part of the project, CLARIN.SI leads the sixth work package “Maintenance of an infrastructure centre for language resources and technologies“. Within the work package CLARIN.SI takes care of the public availability of the language resources, which are created within the project. When creating a language resource, international standards and good practices are taken into account, while the published resources will be safely archived in the CLARIN.SI repository, with the produced corpora also available through the online CLARIN.SI concordancers.
RDA Node Slovenia
CLARIN.SI in 2020-2021 participated in the establishment of the national RDA Node, “RDA Node Slovenia”, which acts as a long-term central contact point between the Research Data Alliance and data practitioners. RDA Node Slovenia is coordinated by the Slovenian Social Science Data Archives. The RDA Node Slovenia data community was initially composed of the representatives of the Humanities (DARIAH-SI) and Linguistics (CLARIN.SI) research data infrastructures, and the University of Ljubljana.
CLARIN.SI, within the working group titled “Coordination of infrastructure data services”, produced the following overview of Slovenian data repositories:
- Meden, K., & Erjavec, T. (2021). Pregled Slovenskih repozitorijev raziskovalnih podatkov. CLARIN.SI. [PDF] [DOCX]
Development of RI-SI-CLARIN
Project “Development of research infrastructure for the international competitiveness of Slovene RRI space RI-SI-CLARIN” was carried out within the Operational Program for the Implementation of European Cohesion Policy in the period 2014-2020, specifically in 2019-2021. The projects financed new equipment for the infrastructure, and was financed in the amount of € 477,932.82 incl. tax.
As part of the project, the following purchases of research equipment have been carried out:
- Jožef Stefan Institute: 2 clusters of high-performance computers with associated equipment for the purposes of faster and fault-tolerant CLARIN.SI web services, mainly the repository platform, web concordancers and services for automatic linguistic text annotation;
- University of Ljubljana: a high-performance server for storing and accessing language resources, managed by the Centre for Language Resources and Technologies of the University of Ljubljana;
- University of Maribor: a cluster of GPU servers in 2019 and its upgrade in 2021, intended for research that uses deep learning; high-performance servers for processing large language data; and the work of a technician.
Projects supported by CLARIN.SI
From 2018 onwards, CLARIN.SI annually publishes a tender for developing new or upgrading already existing resources or services, or other work that furthers the CLARIN(.SI) strategy. CLARIN.SI allocates € 30,000 per year for the implementation of projects.
CLARIN.SI project reports 2022
In 2022 CLARIN.SI accepted six projects for funding. Below are described the already completed projects:
Online service for advanced querying of Slovenian Universal Dependencies treebanks
Project leader: Kaja Dobrovoljc, FF UL
Developer: Miha Štravs, FRI UL student
Funding: 5,000 €
Drevesnik (https://orodja.cjvt.si/drevesnik/) is an online service aimed at linguists and other researchers that enables querying syntactically parsed corpora in Slovenian with easy-to-use query language on the one hand and user-friendly graph visualisations on the other. It is based on the open-source dep_search tool, which was localized and modified so as to also support querying by JOS morphosyntactic tags, random distribution of results and filtering by sentence length. The query language is explained on a separate help page containing several illustrative examples and enables users to search for a wide spectrum of grammatical phenomena, from single words to complex syntactic structures. The results are displayed as dependency trees (graphs) and can also be downloaded in various formats. Currently, there are three corpora available for browsing – the manually annotated reference treebanks of written (SSJ) and spoken Slovenian (SST), and the automatically parsed ccKres corpus – however, new corpora in CONLL-U can also be added.
Source code and the documentation for the search backend and the web user interface is publicly available on the CLARIN.SI GitHub repository drevesnik.
The KUSS corpus of textbooks for learning Slovenian as a second and foreign language and basic vocabulary lists for levels A1, A2 and B1
Project leader: Matej Klemen, FF UL
Other collaborators: Špela Arhar Holdt, FF+FRI UL, Damjan Huber, FF UL, Iztok Kosem, FF+FRI UL, Mateja Lutar, FF UL, Senja Pollak, IJS
Funding: 2.500 €
The project developed a corpus of textbooks for learning Slovenian as a second and foreign language, KUUS, and, by analyzing its vocabulary, a core vocabulary list for levels A1, A2 and B1 according to Common European Framework of Reference for Languages (CEFR). The KUUS corpus comprises 17 textbooks for Slovenian as a second and foreign language published by the Centre for Slovene as a Second and Foreign Language, which are currently widely used in teaching Slovenian as a second and foreign language to children, adolescents and adults in Slovenia and abroad. The KUUS corpus consists of 520,796 words. It is annotated and includes relevant metadata about the textbooks.
Word lists for each level of language proficiency have a long tradition in foreign language learning. For Slovenian as a second and foreign language, they have been included in various ways in language documents, e.g. in the Preživetvena raven za slovenščino (Survival Level for Slovenian; Pirih Svetina et al. 2004), the Sporazumevalni prag za slovenščino (Threshold Level for the Slovenian Language; Ferbežar et al. 2004), etc., and have been prepared as a consensus of the authors of the individual documents. In the project, however, we have prepared a list based on a corpus approach, combining vocabulary for different levels in one document.
We exported words or lemmas from the KUUS corpus and defined robust numerical criteria, which were used to assign words to the CEFR level label: A1-core, A1-broader, A2, B1, etc. We checked whether each word appeared in textbooks as well as in the Reference List of Common Common Vocabulary (Pollak et al. 2020, http://hdl.handle.net/11356/1346). Words that were tagged A1, A2, B1 and at the same time were not part of the Reference List of Slovene Frequent Common Words were manually reviewed and content-categorised. A certain proportion of these words were identified as relevant candidates for inclusion in the core vocabulary lists marked with CEFR levels: e.g., we included typical textbook linguistic terminology (e.g. poved, pogojnik, modalen). Thus, the list of core vocabulary in the current version consists of 350 words tagged A1-core, 864 words tagged A1-broader, 1451 words tagged A2, and 2608 words at B1 level; 5273 words in total.
The results of the project are available in the CLARIN.SI repository in two entries under the ACA ID-BY-NC-INF-NORED 1.0 licence:
- Corpus of textbooks for learning Slovenian as L2 KUUS 1.0: http://hdl.handle.net/11356/1696
- Core vocabulary for Slovenian as L2 1.0: http://hdl.handle.net/11356/1697
The preparation and compilation of the corpus and the production of the core vocabulary list for levels A1, A2 and B1 according to CEFR are presented in more detail in the paper:
KLEMEN, Matej, ARHAR HOLDT, Špela, POLLAK, Senja, KOSEM, Iztok, HUBER, Damjan, LUTAR, Mateja, 2022: Korpus učbenikov za učenje slovenščine kot drugega in tujega jezika. Nataša Pirih Svetina, Ina Ferbežar (eds.): Na stičišču svetov: slovenščina kot drugi in tuji jezik. Obdobja 41. Ljubljana: Založba Univerze v Ljubljani. 165–174. DOI: https://doi.org/10.4312/Obdobja.41.2784-7152.
Parallel Corpus of Idiomatic Texts ParaDiom
Project leader: Gregor Donaj, UM FERI
Other collaborators: Špela Antloga, UM FERI
Funding: 6,000 EUR
ParaDiom is a parallel corpus with sentences sampled from existing corpora. The corpus contains 1,000 Slovene sentences with their English translation and 1,000 English sentences with their Slovene translations. The sampled sentences contain idioms, similes, and proverbs, which are annotated in the corpus.
Sentences were sampled based on a selection of 100 Slovene and 92 English idioms and similes by searching through sentences in the corpora ccGigafida, ParlaMint, and The Corpus of Late Modern English Texts. All sampled sentences were tagged with MULTEXT-East MSD tags, Universal Dependencies morphological features and lemmas using Stanza for English and CLASSLA for Slovene sentences. Some idioms were found as part of proverbs, which were also annotated. Half of the sampled sentences were translated by hand, and the other half were translated using machine translation and post-editing. We used the Q-CAT annotation tool to annotate the idiomatic expressions. The annotated noun, adjective and adverbial idioms were given the label MWE ID (‘idiomatric multiword expression’), verb idioms MWE VID (‘verbal idiomatic multiword expression’), similes MWE SIM (‘simile’), and proverbs MWE P (‘proverb’).
The results of the project are available in the CLARIN.SI repository under the CC BY-NC-SA 4.0 license:
- Parallel corpus of idiomatic text ParaDiom 1.0; http://hdl.handle.net/11356/1714.
Compilation of the SI-NLI Slovene Natural Language Inference Dataset
Project leader: Matej Klemen, UL FRI
Other collaborators: Aleš Žagar, UL FRI, Jaka Čibej, UL CJVT, Marko Robnik-Šikonja, UL FRI
Funding: 10,000 EUR
SI-NLI (Slovene Natural Language Inference Dataset) is a dataset that enables training models to identify natural inference relations for a set of sentence pairs. For instance, the premise “Med bregoma teče pet metrov široka reka.” (A five-meter-wide river flows before me.) and the hypothesis “Skočil sem z enega na drugi breg.” (I jumped from one riverbank to another.) are annotated as a contradiction (because it is physically impossible for a human to make a jump that long). The dataset was compiled using sentences that occur in Slovene reference corpora. The main goal of the compilation process was to generate varied and diverse examples. An overview of related datasets for English revealed that many of them contain simple examples, which can cause language models to rely on surface-level features instead of logical inference. The final dataset contains 5,937 sentence pairs and is divided into a training set (4,392 examples), a validation set (547 examples), and a testing set (998 examples). The division of examples was done using Slovene BERT-type language models to ensure that both simple and complex examples are uniformly distributed among all three subsets.
The dataset was compiled using a semi-automatic approach consisting of two steps. In the first step, candidate sentences pairs (e.g. a premise and a hypothesis) were extracted from the ccKres 1.0 Reference Corpus of Slovene using a neural sentence encoder. In the second step, the annotators were tasked with modifying the suggested hypothesis (or generating a new hypothesis) for each premise and each language inference relation: entailment (E), neutrality (N), and contradiction (C). The process was described in more detail in guidelines designed to ensure that the generated examples were as diverse as possible. For instance, the guidelines state that simple negation is unsuitable for generating examples for contradiction (C). Every example was annotated by at least two annotators, with some examples additionally annotated by a third annotator. The SI-NLI dataset thus enables a thorough analysis of inference capabilities of Slovene language models, and because particular attention was given to the process of generating diverse examples, the dataset is a methodological improvement not only for Slovene language resources, but in general.
The results of the project are available as follows:
- The source code for extracting candidate sentence pairs for further annotation and training natural language inference models is available at https://github.com/clarinsi/si-nli.
- The dataset is available at the CLARIN.SI repository: Slovene Natural Language Inference Dataset SI-NLI, http://hdl.handle.net/11356/1707. Because it is also implemented in the SloBENCH evaluation framework (https://slobench.cjvt.si/), the test set labels are private.
- We have trained two Slovene natural language inference models on the compiled dataset: a monolingual SloBERTa model (which achieves an accuracy of 73.5%) and a multilingual CroSloEngual model (with an accuracy of 67.3%). The models are publicly available in the HuggingFace model repository at https://huggingface.co/cjvt/sloberta-si-nli, https://huggingface.co/cjvt/crosloengual-bert-si-nli.
Compiling a corpus of political party programmes for the 2022 Parliamentary Election
Project leader: Andrej Pančur, INZ
Other collaborators: Petra Polanič, Filip Dobranić, INZ
Funding requested: 2,500 EUR
The corpus includes programmes used by political parties to participate in the 2022 Slovenian Parliamentary Election on April 24, 2022. The programmes were included as they were published on the parties’ websites up until the day before the election.
The text of an individual party’s programme was stored in a separate file, with the exception of the parties Naša prihodnost and Dobra država, which ran together for the election and were thus treated together, i.e. are stored in one file. Each file was first converted into .txt format, which is unchanged as regards the original except for some specific elements that were excluded in all programmes they were appeared in: introductions of the programme by party leaders, table of contents, names of candidates and districts they run in, and longer quotations (for example from other party documents or the party congress). The text of the programme was examined and cleaned of parts such as text that appeared twice, headers and footers, text that was a part of figure, descriptions and sources of photos. In the case of two parties that published their programmes as text on their websites, the corpus doesn’t include unfinished sections of the programmes (explicit statements that the section is still being edited or sample text). The text of the programmes was annotated using the CLASSLA tool and converted into the CONLL-U format.
The 19 programmes consist of a total of 330,559 tokens. The shortest programme in the corpus is the programme by the Lista Borisa Popoviča party (264 tokens) and the longest is the Socialni demokrati party programme (67,071 tokens). The metadata features the party name, its URL and the URL of the programme.
The corpus allows us to examine the programmes and compare their content based on linguistic features (for example by comparing the most frequent adjectives in individual programmes, the presence and frequencies of specific words and phrases and so on). There are considerable differences among parties in their presentation of their programmes; from obvious differences in length to different formats and visual elements (such as the use of graphs, photos, charts). Systematic ways of documenting such elements could be valuable in upgrading this corpus or developing similar corpora in the future. Additionally, the linguistic and content analysis of political party programmes could greatly benefit from larger corpora that include programmes produced over a longer period of time, allowing for comparison within one party through the years, among the most frequently mentioned topics of each election, and so on.
The results of the project are available in the CLARIN.SI repository under the CC BY-NC-SA 4.0 license:
- Corpus of political party programs Programi2022; http://hdl.handle.net/11356/1734.
Automatic speech recognition test dataset for SloBench platform
Project leader: Darinka Verdonik, UM FERI
Other collaborators: Andreja Bizjak, Simona Majhenič, UM FERI
Funding: 6,000 EUR
In 2021, the evaluation platform SloBench (https://slobench.cjvt.si/) was established. Its goal is to enable independent evaluation of language technologies tools for the Slovenian language. Evaluation data are hidden. In the SloBench ASR project we have prepared the text dataset for speech recognition evaluation for the Slovenian language. The data includes the recordings and speakers which are, according to our knowledge, not present in the available speech databases for the Slovenian language. The data is structured as followed:
- 15 recordings in total duration 3h 18min 28sec (3:18:28)
- public speech in total duration 2:08:35sec and private speech in total duration 1:09:53
- 9 recordings in total duration 2:03:04 from south-western part of Slovenia and 6 recordings in total duration 1:15:24 from north-eastern part of Slovenia
- 18 male speakers and 19 female speakers
- public speech includes topics evolution, description of a settlement, scientific slam, description of a life, culture of speech, news, books, energetics, and the private speech includes 4 monologues and 3 dialogues between two persons
- in private speech, 10 speakers are recorded, 3 of them are up to 30 years old, 5 of them are between 30 and 49 years old and 2 of them are over 50 years old
All of the recordings are manually transcribed in two modes, the colloquial transcription and the standardised transcription (Verdonik et al. 2013), following the same standards as those used for the transcription of the Artur speech database for ASR, developed in the ‘Development of Slovene in a Digital Environment’ project and available on the CLARIN.SI repository.
SloBench speech recognition test dataset recordings are available on https://slobench.cjvt.si/leaderboard/view/10. Transcriptions are used for performance of evaluation. Results of the evaluation are published on https://slobench.cjvt.si/leaderboard/view/10.
CLARIN.SI project reports 2021
In 2021 CLARIN.SI accepted four projects for funding, however, only three were successfully completed and are described below.
Extractions from KAS corpus
Applicants: Aleš Žagar, Matic Kavaš and Marko Robnik-Šikonja, University of Ljubljana, Faculty of Computer and Information Science
Funds awarded:9,500 €
The Corpus of Academic Slovene KAS 1.0 (http://hdl.handle.net/11356/1244) contains BSc, MSc, and PhD theses from the Slovene open science portal in the amount of approximately 82,000 documents, and there also exists a separate repository entry (http://hdl.handle.net/11356/1420) with the abstracts from the KAS corpus. The analysis of the KAS corpus showed that many documents are unsatisfactorily extracted and structured. The inconsistencies we detected are, e.g. mixed Slovenian and English abstracts and keywords, absence of abstracts, other texts instead of abstracts, non-segmented texts, non-existent text classifications, noisy extraction of some text elements, etc. So far, datasets extracted from the corpus have not included summaries nor exploited the coexistence of English and Slovenian abstracts for machine translation.
The project produced a cleaner version of the KAS corpus with added segmentation into chapters, and updated its PoS-tagging. The updated corpus of abstracts contains less noise and contains language labeled abstracts. We extracted approximately 72,000 Slovenian and 54,000 English abstracts. Using machine learning models, we improved the metadata, supplementing about half of the missing information on the CERIF research areas. From extracted texts and summaries we created several new datasets: a monolingual (72.000 examples) and cross-lingual dataset (54.000 examples) for summarizing long academic texts, and a dataset of aligned sentences from summaries in English and Slovene suitable for training or evaluating machine translation systems. We created three versions of the machine translation dataset with different reliability of alignments: default alignment contains approximately 497 thousand pairs, more reliable alignment 475 thousand, and highly reliable alignment 426 thousand translation pairs.
The program code is available in the repository https://github.com/korpus-kas. With the program code it is possible to extract texts and abstracts, built models for the classification of research areas of individual works and align sentences of abstracts written in English and Slovenian.
The corpora and datasets are published in the CLARIN.SI repository:
- Corpus of Academic Slovene KAS 2.0: http://hdl.handle.net/11356/1448
- Abstracts from the KAS corpus KAS-Abs 2.0: http://hdl.handle.net/11356/1449
- Summarization datasets from the KAS corpus KAS-Sum 1.0: http://hdl.handle.net/11356/1446
- Machine Translation datasets from the KAS corpus KAS-MT 1.0: http://hdl.handle.net/11356/1447
We describe the procedures for extracting and preparing the datasets in the paper:
Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society – IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228
SloBENCH: Design and implementation of an evaluation framework for language technologies
Applicants: Slavko Žitnik, Simon Krek, Marko Robnik-Šikonja and Frenk Dragar, University of Ljubljana, Faculty for computer and information science
Funds awarded: 10,000 €
There are a number of tasks that are important for the development of the natural language processing of a specific language. Examples of such tasks are automatic summarization, translation, part-of-speech tagging and information extraction techniques. Language resources and technologies are available through various platforms (e.g., the CLARIN.SI repository) but their objective comparison is not done end-to-end or uniformly.
The results of this project provide a number of possibilities for overview and transparency over the landscape of developed tools and resources for the Slovenian language. The SloBENCH tool is a Web portal containing publicly available leaderboards for an arbitrary natural language processing task. It allows for multiple user roles for adding, editing and creation of new leaderboard versions. Web services implement automatic evaluations and specific implementation or calculation of benchmarking scores for each leaderboard. Evaluation tools that are part of SloBENCH are published and maintained in the public CLARIN.SI source code repository. For simplicity of testing, they enable running each evaluation tool separately to anyone interested in how evaluation is done within SloBENCH.
The initial version of SloBENCH contains evaluation scripts with examples of training and testing datasets for nine different tasks: named entity recognition, part-of-speech tagging, lemmatization, dependency parsing, semantic role labelling, translation (ENG-SLO, SLO-ENG), summarization and question answering.
After the end of the project, the maintenance will be performed by CJVT. Apart from the internal source code repository of the SloBENCH portal and its documentation within CJVT, the project provides the following public resources:
- Portal https://slobench.cjvt.si: Main public access to all available leaderboards
- Evaluation framework: https://github.com/clarinsi/slobench-eval-docker
- Public DockerHub repository with pre-built Docker images, used by SloBENCH: https://hub.docker.com/r/slobench/eval/tags.
Corpus of metaphorical expressions in spoken Slovene language G-KOMET
Applicant: Špela Antloga, Faculty of Electrical Engineering, Computer Science and Informatics, University of Maribor
Funds awarded: 6,000 €
G-KOMET (a corpus of metaphorical expressions in spoken Slovene language) is an upgrade of the hand-annotated written corpus for metaphorical expressions KOMET 1.0 with transcriptions of speech and conversation that covers 50.000 lexical units. The corpus includes a balanced set of transcriptions of informative, educational, entertaining, private, and public discourse. It contains hand-annotated metaphor-related words, i.e. linguistic expressions that have the potential for people to interpret them as metaphors, idioms, i.e. multi-word units in which at least one word has been used metaphorically, and metonymies, expressions that we use, to express something else.
The annotation scheme was based on the MIPVU metaphor identification process. This protocol was modified and adapted to the specifics of the Slovene language and the specifics of the spoken language. Corpus was annotated for the following relations to metaphor: indirect metaphor, direct metaphor, borderline cases and metaphor signals. In addition, the corpus introduces a new ‘frame’ tag, which gives information about a concept to which it refers. This conceptual frame allows us to search for figurative expressions within a specific context category (e.g. time, spatial orientation, emotions etc.). Metonymies were furthermore categorized based on the specific metonymic mapping. Corpus of metaphorical expressions in spoken Slovene language G-KOMET allows an objective and systematic analysis of metaphorical expressions, metaphors and metonymies in various Slovene texts.
The corpus is published in the CLARIN.SI repository:
- Corpus of metaphorical expressions in spoken Slovene language G-KOMET 1.0: http://hdl.handle.net/11356/1490.
CLARIN.SI project reports 2020
In 2020 CLARIN.SI received fewer project proposals than in previous years, to a large extent due to the intensive involvement of almost all consortium members in the RSDO project. Three projects were selected for funding, however, one of the projects did not start due to copyright problems with its data. The two successfully concluded projects are described below.
Tutorial on the siParl 2.0 corpus: Voices of the parliament – Corpus linguistics approach to parliamentary discourse
Applicant: Kristina Pahor de Maiti, Faculty of Arts, University of Ljubljana
Funds awarded: € 5,000
CLARIN.SI and Slovene researchers have immensely contributed towards the development of parliamentary corpora and an improved understanding of the potential of parliamentary corpora for the researchers in the European scientific community (via the development of annotation recommendations and parliamentary corpora for several languages, an overview of available parliamentary corpora, organization of content-related international scientific events). However, in the Slovene scientific community, this potential has not yet been fully recognized nor exploited, which is why we used this project to create a tutorial that could help close this gap.
The aim of this project was to create a user-friendly, methodologically sound and research-relevant tutorial that demonstrates the potential of linguistic corpora for the analysis of socio-cultural phenomena through language use in specialized discourse. The tutorial is based on the siParl 2.0 corpus (http://hdl.handle.net/11356/1300) which contains records of Slovene parliamentary debates from 1990–2018, while the analytical tool used was the CLARIN.SI noSketch Engine concordancer (https://www.clarin.si/noske/), i.e. the siParl 2.0 corpus available through this tool.
The tutorial starts with a brief theoretical introduction covering the peculiarities of the specialized political discourse and the effect of gender on communication practices as well as providing an explanation of the most popular corpus analysis techniques. The main part of the tutorial comprises three tasks in which we use various corpus analysis techniques to better understand women’s position in the Slovene parliament. We adopt a step-by-step approach in order to guide the reader from formulating queries and analytical procedures to the interpretation of the results. In addition, we supply screencasts for each task that demonstrate the use of a concordancer which helps the user in carrying out the showcased procedures independently.
While the tutorial uses Slovene corpus data, the analyses demonstrated in the tutorial can also be performed on similar parliamentary corpora in other languages as well as generalized to investigate other types of linguistic corpora. On the one hand, this encourages international comparison of parliamentary culture and discourse, and on the other hand, promotes cross-disciplinary exchange of methodological approaches. In order to reach the international audience, there is a Slovene and English version of the tutorial available.
The tutorial is available in both Slovene and English in the digital library of the Institute of Contemporary History:
- FIŠER, Darja, PAHOR DE MAITI, Kristina. Voices of the parliament: “First, I’m a Female Politician, Not a Male One, and Second …”: a corpus approach to parliamentary discourse research. Institute of Contemporary History, 2021. ISBN 978-961-7104-06-6. https://sidih.github.io/voices/index.html.
The compilation of the MEMIS epigraphic corpus of Medieval and Early Modern inscriptions in Slovenia
Applicant: Gregor Pobežin, Insitute for Cultural History ZRC SAZU
Funds awarded: € 4,000
In the project “Epigraphic corpus of Medieval and Early Modern inscriptions in Slovenia MEMIS 1.0” 51 Latin inscriptions from the Medieval and Early modern period (dating between 1222 and mid- 17th century) were collected, catalogued, processed in XML and translated; in its present extent, the corpus contains only the inscriptions from the Slovenian coastal towns, particularly Koper and Piran, i.e. all inscriptions either still present in their primary context as well as those that were at some point moved or even destroyed and therefore only available in transcripts. The material for the corpus was collected during field research by collecting the inscriptions in situ.
The inscriptions contained in the corpus are fully expanded (i.e., the abbreviatures and ligatures), commented and translated, with most of the relevant epigraphic metadata; for this purpose, the EpiDoc template XML file was used, which facilitates the processing of metadata.
The purpose of the corpus is to create a methodological basis for the processing of medieval and early modern inscriptions in Latin (and in vernacular languages), which are located or were discovered in the area of the Slovenian ethnic territory. The corpus, which will be published as an integrated source within the DARIAH.SI infrastructure will enable the systematic processing and publication of a rich (written) cultural heritage, which, unlike Roman-era inscriptions, has not been addressed thus far in a scientific manner.
Epigraphic corpus MEMIS 1.0 is available under the CC BY-NC-SA 4.0 licence at:
- Pobežin, Gregor, 2020, Epigraphic corpus of Medieval and Early Modern inscriptions in Slovenia MEMIS 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1376.
CLARIN.SI project reports 2019
Following the successful introduction of this initiative in 2018, CLARIN.SI in 2019 again launched a call for project proposals for the members of its consortium. The call, with a budget of €30,000, targeted projects that either built or upgraded resources or services that contribute to the advancement of the CLARIN.SI mission. Six project proposals were accepted with their descriptions and the resources they produced given below.
Tool for statistical analysis of dependency-parsed corpora
Applicants: Kaja Dobrovoljc (FF UL), Marko Robnik Šikonja (FRI UL)
Funds awarded: € 6,000
Within the project, we have developed a computer program for statistical analysis of parsed corpora (the STARK tool) that produces frequency lists of trees from dependency parsed corpora. The user defines the type of trees to be extracted through several parameters in the configuration file, such as the number of nodes in the tree and their type (from word forms to abstract grammatical categories), and the potential differentiation of trees based on their completeness, labelling and surface word order. In addition to such bottom-up approach to dependency tree extraction which does not rely on any linguistic assumptions, the tool also enables tree extractions based on additional restrictions and queries with pre-defined tree structures. The results are displayed in the form of a tabular text file with information on the tree structure and its nodes as well as on the corpus frequency and the strength of statistical association between nodes through different association measures. The tool expects the standard CONLL-U format as input, making it directly applicable not only to Slovenian corpora, such as the ssj500k treebank or the 1-billion-word Gigafida reference corpus but also to more than 70 other languages with the same type of data already available.
The STARK command-line tool is publicly available through the Apache 2.0 license at https://github.com/clarinsi/STARK, and can also be downloaded from the CLARIN.SI repository:
- Krsnik, Luka; Dobrovoljc, Kaja and Robnik-Šikonja, Marko, 2019, Dependency tree extraction tool STARK 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1284.
Establishing access to historical versions of Slovenian language reference corpus Gigafida
Applicant: Andraž Repar, CJVT
Funds awarded: € 1,500
The CLARIN.SI concordancers noSketch Engine and KonText offered only the latest version of the Slovenian reference corpus Gigafida v2.0. This corpus has newly linguistically annotated and, contrary to its previous versions, does not include non-standard texts.
For various reasons, it is sometimes necessary to access previous versions of the Gigafida corpus, e.g. to access the removed non-standard texts (this is esp. relevant for research on Slovenian in neighbouring countries because sources containing this version of Slovene (e.g., the newsletter Novi Matajur) were removed due to their nonstandardness). Additionally, access to the older versions enables replicability of previous research performed on this corpus.
The project enabled the previous versions of the Gigafida corpus (FidaPLUS, Gigafida 1.0 and Gigafida 1.1) to be mounted on the CLARIN.SI noSketch Engine and KonText platforms. The plan was also to mount the first version of the corpus, so-called FIDA, where agreements were signed with the copyright holders, i.e. the companies Amebis, d.o.o. and DZS, d.d. Unfortunately, the project funds were only sufficient to cover the copyright transfer from DZS to the Ljubljana University, with none left over to enable the transfer of data from CD ROMs to the digital form necessary to publish them on the two CLARIN.SI platforms.
CLARIN.SI noSketch Engine and KonText now mount the following versions of the Gigafida corpus:
- Gigafida v2.0 proto (nededupliciran): noSketch Engine, KonText,
- Gigafida v2.0 (dedupliciran): noSketch Engine, KonText,
- Gigafida v1.1 (nededupliciran): noSketch Engine, KonText,
- Gigafida v1.1 dedup (dedupliciran): noSketch Engine, KonText,
- Gigafida v1.0: noSketch Engine, KonText,
- FidaPLUS: noSketch Engine, KonText.
Corpus for Slovene coreference resolution and aspect-based sentiment analysis–SentiCoref 1.0
Applicant: Slavko Žitnik, FRI UL
Funds awarded: € 6,000
The aim of the project was to compile the SentiCoref 1.0 corpus which includes sentiment annotations for specific entities in the text. In addition to the sentiment level annotation, coreferences and named entities were also tagged. Named entities include person names, organization names and locations. Each named entity is annotated along with all the coreferent mentions that refer to an underlying entity. The corpus enables better coreference resolution analyses and aspect-based sentiment analysis for the Slovene language.
SentiCoref 1.0 corpus contains texts from SentiNews 1.0 corpus (Bučar, 2017) that consists of 10,427 documents. Each of the documents from SentiNews 1.0 corpus is annotated with a five-level sentiment on a level of document, paragraph and sentence. SentiCoref 1.0 consists of 837 documents selected from SentiNews 1.0 based on the number of named entities (automatically tagged using Polyglot tool) which contain between 50 to 73 named entities.
SentiCoref 1.0 corpus consists of 31,419 named entities: 15,285 organization names, 8,606 person names and 7,528 locations. All the documents form 14,572 coreference chains (i.e., entities) with 438,733 entity mentions. Entities are annotated using the following sentiment levels: very negative: 30 entities; negative: 1,801 entities; neutral: 10,869 entities; positive: 1,705 entities; very positive: 24 entities.
The SentiCoref 1.0 corpus along with the annotation guidelines is available under the CC BY 4.0 licence:
- Žitnik, Slavko, 2019, Slovene corpus for aspect-based sentiment analysis – SentiCoref 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1285.
Speech corpus of dialogue acts GORDAN 1.0
Applicant: Darinka Verdonik, Faculty of Electrical Engineering, Computer Science and Informatics, University of Maribor
Funds awarded 6,000 €
During the project Speech corpus of dialogue acts GORDAN 1.0 we developed the dialogue act corpus for Slovene. The corpus contains a balanced sample of different types of spoken discourse in total length of one hour. The data was selected from previously existing Slovene corpora (the GOS, Gos VideoLectures and BERTA) according to four criteria: public/non-public, interactive/monologic, channel and intention.
Before selecting and defining the annotation scheme, four well-established schemes (MRDA, AMI, ISO 24617-2 and DART) were evaluated based on the following criteria: ensuring annotation of pragmatic meaning, coherent structure, general validity and well-balanced structure. Substantial drawbacks regarding these criteria were found in all of the existent schemes. Based on these findings, we have defined the GORDAN 1.0 scheme which keeps the advantages of the analysed schemes and overcomes their drawbacks.
The selected data has been annotated in accordance with the GORDAN 1.0 annotation scheme in the Transcriber 1.5.1 tool using its function Event. If the video recording was available, the annotators used multimodal data coupling the audio and video recording.
The data are available as two separate datasets:
- Zwitter Vitez, Ana; et al., 2020, Dialogue act annotated spoken corpus GORDAN 1.0 (audio/video), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1292: includes original audio recordings (and video recordings if available) that are downloadable under the original licence, i.e., CC BY-NC-ND 4.0.
- Verdonik, Darinka, 2020, Dialogue act annotated spoken corpus GORDAN 1.0 (transcription), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1291: includes transcriptions, dialogue act annotations and the GORDAN 1.0 annotation scheme specification that can be distributed under the CC BY 4.0 licence.
Slovene metaphor corpus Komet 1.0
Applicant: Špela Antloga, FERI UM
Funds awarded: € 4,000
The Komet corpus is a hand-annotated corpus of metaphorical expressions which covers 200,000 lexical units from Slovene journalistic, fiction and on-line texts. Metaphors are a complex phenomenon that can be rendered on the linguistic level by novel and creative expressions, or strongly lexicalized units that are hardly noticeable as metaphorical. Understanding the complexity of metaphor phenomenon and the need for clearly defined guidelines for metaphor identification in texts, a group of English linguists developed a procedure for metaphor identification in text: the MIPVU protocol. Since the research on metaphors in Slovene has been very unsystematic and vague, an adapted and modified procedure was used to create a Slovene corpus of metaphors. In this corpus, lexical units (words) without the same contextual and basic meaning are considered metaphor-related words. Basic and contextual meaning for each word in the corpus was defined using the Dictionary of the standard Slovene Language. Corpus was annotated for the following relations to metaphor: indirect metaphor, direct metaphor, borderline cases and metaphor signals. In addition, the corpus is also annotated with conceptual frames which holds information about a concept to which it refers. This conceptual frame allows us to search for figurative expressions within a specific context category (e.g., time, spatial orientation, emotions, etc.). The Slovene metaphor corpus Komet enables objective and systematic analysis of metaphorical expressions and metaphors in various Slovene texts.
The Komet corpus is available under the CC BY-NC-SA 4.0 licence:
- Antloga, Špela, 2020, Metaphor corpus KOMET 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1293.
Placing new orthographical rules on the Fran portal
Applicant: ZRC SAZU, the Scientific Research Center of the Slovenian Academy of Sciences and Arts
Funds received: € 6,000
The project launched a public presentation of new orthographical rules and corresponding dictionary entries to supplement the rules on the Fran portal. The drafts of the first two chapters of the orthographical rules allow users to participate in a public debate on the adequacy of the proposed solutions and their content.
For this purpose, each chapter of the new spelling rules was converted from Word .docx text format to TEI XML, with a converter developed for this purpose. In this way, orthographical rules are available in accordance with international recommendations, thus facilitating their further development, maintenance, and distribution, as well as their connectivity and adaptability to different uses. The TEI-encoded rules are linked to the Slovenian Normative Guide Dictionary (e-Pravopis).
Namely, simultaneously with the revision of individual chapters of the orthographical rules at the Fran Ramovš Institute of the Slovenian Language, ZRC SAZU is creating a Slovenian Normative Guide Dictionary – ePravopis. Linking ePravopis to the rules is an important step because it combines the information from the dictionary and the rules. Users are thus offered insights that such guides have not allowed so far.
The first two chapters are available on the Fran portal, while the TEI files will be available on the CLARIN.SI repository under CC BY-NC 4.0 license once all the chapters have been prepared.
CLARIN.SI project reports 2018
In 2018, CLARIN.SI launched for the first time a call for project proposals for the members of its consortium. The call, with a budget of €30,000, targeted projects that either built or upgraded resources or services that contribute to the advancement of the CLARIN.SI mission. Seven project proposals were accepted with their descriptions and the resources they produced given below.
Upgrade of the eZISS digital library of text-critical editions of Slovene literature
Applicants: Andrej Pančur, INZ, Matija Ogrin, ZRC SAZU
Funds awarded: € 4,000
The project has upgraded two very complex and extensive editions which include diverse components and realise a variety of text-critical concepts of analysing and displaying texts. In addition, it developed a significantly improved display of the electronic edition, its internal structure (transcriptions, digital facsimiles, notes, critical apparatus, the accompanying scientific commentary) as well as the links between the components. The existing XSLT transformations from the GitHub repository (https://github.com/SIstory/Stylesheets) have been adapted and for the purpose of ensuring the dynamic display of the parallel sections upgraded with XSLT 3.0 transformation for SAXON-JS. The XSLT transformations are accessible in the “Profiles” folder at https://github.com/DARIAH-SI/Foglar-pub and https://github.com/DARIAH-SI/Kapelski-pub.
The project has also entailed editorial work on both editions:
- Kapelski pasijon (The Železna Kapla Passion Play): the tagging in the TEI markup language has been improved, the scientific commentary has been partly reorganised and all the transcriptions have been linked with the associated digital facsimile files and mutual references.
- Foglarjev rokopis (The Foglar Manuscript): a complete digital edition has been created with the diplomatic and critical transcription of the manuscript, including the apparatus of variants for the several handwritten versions of the poems under consideration. The edition has been prepared by Nina Ditmajer. Both transcriptions have been linked with digital facsimiles and their tagging in the TEI markup language has been adapted to the various possibilities of displaying and linking the texts.
An important motive and aspect of the process upgrade is its usefulness for the future digital editions of the eZISS library in the context of the DARIAH-SI research infrastructure. Namely, DARIAH-SI aims to establish a TEI-based digital library enabling the presentation of complex digital editions such as the Železna Kaplja Passion Play or Foglar Manuscript, and a connection to the corpus analysis services at CLARIN.SI.
The Železna kaplja passion play is accessible at:
The Foglar manuscript is accessible at:
The corpus of parliamentary minutes of the National Assembly of the Republic of Slovenia 1990-2018
Applicant: Andrej Pančur, INZ
Funds awarded: € 3,000
During the project, the siParl corpus has been created. It contains all the parliamentary minutes of the National Assembly of the Republic of Slovenia between 1990 and 2018 (until the end of the 7th legislative period) as well as all the minutes of the National Assembly’s working bodies since 1996, all of which amounts to almost 230 million of tokens in total. The parliamentary minutes from the 1990–1992 period have been obtained from the existing SlovParl 2.0 corpus, while the rest of the minutes have been newly tagged. The tagging has been completed in the TEI module for drama texts and converted into the TEI module for speech transcription. The corpus includes data about the speeches and the speakers, non-verbal content of the session minutes and relevant metadata. The content of the speeches has also been linguistically tagged, i.e. tokenised, morphosyntactically tagged and lemmatised.
The siParl corpus is available through the concordance software and for download under the CC BY licence:
- Pančur, Andrej; Erjavec, Tomaž; Ojsteršek, Mihael; Šorn, Mojca and Blaj Hribar, Neja, 2019, Slovenian parliamentary corpus siParl 1.0 (1990-2018), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1236.
Assigning stress to the Sloleks lexicon
Applicant: Špela Arhar Holdt, CJVT UL
Funds received: € 5,000
In the project, the latest version of the Sloleks morphological lexicon of Slovene was improved with the addition of automatically assigned accents, a portion of which were also manually evaluated. The interface of the lexicon was also upgraded to facilitate crowdsourcing of the newly added data. The project focused on the lemmas in which the position of the accent is fixed on the word stem. In the first step, accents were automatically assigned to all word forms in the lexicon. Through existing dictionary resources, 55% of the automatically assigned accents were confirmed with an estimated accuracy of 75%. 24% of the lexicon data was processed manually, the majority with the use of crowdsourcing. Counting both the results of automatic as well as manual approaches, the project corrected 21.7% of the automatically assigned accents. Future work will include proper nouns as well as lemmas with non-fixed accents and accent variants.
The project also upgraded the design of the user interface: (a) by implementing the graphic design developed for CJVT resources; and (b) by upgrading the interface with features that allow the community to participate in database clean-up (i.e. allowing them to upvote/downvote the assigned accents, the automatically generated phonetic transcriptions and the text-to-speech pronunciation). Additional functionalities are also being developed as part of other ongoing projects, such as the possibility for users to contribute recordings of their own pronunciation of words.
The database is available under the CC BY-NC-SA licence:
- Dobrovoljc, Kaja; et al., 2019, Morphological lexicon Sloleks 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1230.
The compilation of word and n-gram lists for various levels of education and for different subjects
Applicant: Iztok Kosem, Faculty of Arts, University of Ljubljana
Funds awarded: € 4,000
The project involved compiling a corpus of textbooks used in Slovenian elementary and secondary schools, and extracting word and n-gram lists and keywords. The collected textbooks were available in PDF and html formats and were converted into text format. Afterwards, the converted texts were examined, issues corrected and then the texts were POS-tagged. The corpus contains around 5 million tokens or 127 textbooks from 16 different subjects. The second step involved the extraction of word lists, etc., and several measures to ensure the quality of data. In addition, lists were manually analysed. The final result are represented by the following lists:
- List of general words occurring in at least 8 out of 16 subjects. The list contains information on lemma, word form and frequency (by education level and the number of subjects).
- List of general words by education level (grade/year) containing information on lemma, word class, frequency and number of levels in which the lemma was found.
- List of 2-5-grams containing the word-forms of the n-gram, its lemmas, word classes and POS-tags, its frequency and the number of subjects in which the n-gram was found.
The lists are available under 4.0 CC BY licence:
- Kosem, Iztok; Pori, Eva and Arhar Holdt, Špela, 2019, Keywords and n-grams from a textbook corpus, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1215.
A Tool for efficient analysis of Slovene corpora
Applicants: Marko Robnik Šikonja, Špela Arhar Holdt, UL FRI
Funds awarded: € 4,000
In the project, a clear and comprehensible user interface for the corpusStatistics tool (renamed to LIST) was developed. The tool offers a user-friendly access to language statistics in corpora of Slovene and other languages. The tool was adapted for several corpus formats and tested on large corpora of Slovene and other languages.
The program now includes metadata in all outputs which enables the reproducibility of the results. The elements of the user interface contain short explanations shown on mouse-over. Several new association measures of word sequences are supported, e.g. Dice, t-score, MI, and MI3. The program now estimates the time needed to return the results and warns users for settings which may require a longer processing time. Users can now also switch between the Slovene and English interfaces and can process non-Latin scripts. The program was upgraded with support for the TEI P5 format used for recently published corpora in the CLARIN.SI repository, and the vertical format (VERT) used by SketchEngine.
The LIST program is available under the Apache2 open licence:
- Krsnik, Luka; et al., 2019, Corpus extraction tool LIST 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1227.
Gos Videolectures II
Applicants: Darinka Verdonik, Andrej Žgank, UM FERI
Funds awarded: € 6,000
The goal of the Gos VideoLectures II project was to enlarge the existing Gos VideoLectures database with additional 8 hours of manually performed transcriptions of the selected speeches from Videolectures.net database. Transcriptions were done in two-level transcriptions system, where the first level represents conversational transcription, and the second level represents standardized transcription as defined in the Slovene GOS corpus. The speech signal was manually segmented to utterances/segments and notable acoustic events were manually annotated. The second goal was to automatically segment speech signal to words and to a restricted list of phonemes. This was done with the adapted version of the automatic speech recognition system for Slovene UMB Broadcast News developed at the Faculty of Electrical Engineering and Computer Science of University of Maribor.
Same as in the previous versions of the Gos VideoLectures database, the conversion of transcriptions from the Transcriber 1.5.1 XML format to metadata files in TEI (module for speech corpora) was performed. Conversion to TEI includes the list of speakers with their metadata, metadata about speeches, co-alignment of utterances and sentences with speech signal, coding of acoustic events and alignment between conversational and standardized transcription. Additionally, words in standardized transcription were automatically lemmatized and tagged with MSD. Along with the conversion, validation of the source files was done and a number of errors was detected and corrected. Based on the TEI files, we created a vertical file that is needed for the import of the database into CLARIN.SI concordancers. Audio files were also adapted for the import into concordancers so that it is now possible to listen to the recordings while searching through Gos VideoLectures corpus.
The corpus is available through CLARIN.SI concordancers and for download:
- Verdonik, Darinka et al., 2019, Spoken corpus Gos VideoLectures 4.0 (transcription), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1223.
- VideoLectures.NET, 2019, Spoken corpus Gos VideoLectures 4.0 (audio), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1222.
The multimedia database of the dictionary of the clothing terminology of the Zilja local dialect of Canale Valley (Val Canale – Kanaltal – Valcjanâl)
Applicant: Carmen Kenda-Jež, ZRC SAZU
Funds awarded: € 4,000
The multimedia database for The Dictionary of the Clothing terminology of the Zilja Local dialect of Canale Valley, published on the FRAN portal, was created from the collection of dialect material previously used for two printed editions of the dictionary. The transfer to the digital environment resulted in the formal adaptation to the new media (e.g. the manner of data presentation, replacement of the abbreviations with their full form or the unification of grammatical qualifiers) and in the range of microstructural changes that were caused by the self-contained presentation of the online dictionary entry and its direct links to the sound clip collection. The final version of the online dictionary is therefore substantially different from its printed versions.
The dictionary, which contains 594 entries, was transformed from the Word format to the dictionary database in XML and equipped with the intra- and inter-dictionary links. The original collection of sound clips was revised. Sound clips of lower quality (e.g. those with overlapping speech) were eliminated. If possible, the new sound material was gathered with additional analysis of previously used soundtracks. The sound clips were linked with dialect lemmas and examples.
Selected photographic material from the ethnographic research archive of clothing culture has been added to the database. For some of the entries the connection with the ethnographic online collection Glasovi Kanalske doline (The voices of Canale Valley) of the Zborzbirk project Kulturna dediščina v zbirkah med Alpami in Krasom (Cultural heritage in the collections between the Alps and the Karst) has been established. The portal Fran gives access to the monographs of the local Ziljsko dialect –Ovčja vas in njena slovenska govorica (Ovčja vas and its Slovenian Speech), 2005; Lipalja vas in njena slovenska govorica (Lipalja vas and its Slovenian Speech, 2016), providing open access as a part of the project.
The database is available at:
- Kenda-Jež, Karmen; Perdih, Andrej and Race, Duša, 2019, The Dictionary of the Clothing Terminology of the Zilja Dialect of Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1217.
- Gliha Komac, Nataša; Kandutsch, Elisa; Bartaloth, Rudi and Smole, Matevž, 2019, The Dictionary of the Clothing Terminology of the Zilja Local Dialect of Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl): photographs, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1221.
- Kenda-Jež, Karmen, 2019, The Dictionary of the Clothing Terminology of the Zilja Dialect of Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl): audio, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1220.