CLARIN.SI Projects

CLARIN.SI Project Reports 2018

Introduction

In 2018, CLARIN.SI launched for the first time a call for project proposals for the members of its consortium. The call, with a budget of €30,000, targeted projects that either built or upgraded resources or services that contribute to the advancement of the CLARIN.SI mission. Seven project proposals were accepted with their descriptions and the resources they produced given below.

Upgrade of the eZISS digital library of text-critical editions of Slovene literature

Applicants: Andrej Pančur, INZ, Matija Ogrin, ZRC SAZU

Funds received: € 4,000

The project has upgraded two very complex and extensive editions which include diverse components and realise a variety of text-critical concepts of analysing and displaying texts. In addition, it developed a significantly improved display of the electronic edition, its internal structure (transcriptions, digital facsimiles, notes, critical apparatus, the accompanying scientific commentary) as well as the links between the components. The existing XSLT transformations from the GitHub repository (https://github.com/SIstory/Stylesheets) have been adapted and for the purpose of ensuring dynamic display of the parallel sections upgraded with XSLT 3.0 transformation for SAXON-JS. The XSLT transformations are accessible in the “Profiles” folder at https://github.com/DARIAH-SI/Foglar-pub and https://github.com/DARIAH-SI/Kapelski-pub.

The project has also entailed editorial work on both editions:

  • Kapelski pasijon (The Železna Kapla Passion Play): the tagging in the TEI markup language has been improved, the scientific commentary has been partly reorganised and all the transcriptions have been linked with the associated digital facsimile files and mutual references.
  • Foglarjev rokopis (The Foglar Manuscript): a complete digital edition has been created with the diplomatic and critical transcription of the manuscript, including the apparatus of variants for the several handwritten versions of the poems under consideration. The edition has been prepared by Nina Ditmajer. Both transcriptions have been linked with digital facsimiles and their tagging in the TEI markup language has been adapted to the various possibilities of displaying and linking the texts.

An important motive and aspect of the process upgrade is its usefulness for the future digital editions of the eZISS library in the context of the DARIAH-SI research infrastructure. Namely, DARIAH-SI aims to establish a TEI-based digital library enabling the presentation of complex digital editions such as the Železna Kaplja Passion Play or Foglar Manuscript, and a connection to the corpus analysis services at CLARIN.SI.

The Železna kaplja passion play is accessible at:

The Foglar manuscript is accessible at:

The corpus of parliamentary minutes of the National Assembly of the Republic of Slovenia 1990-2018

Applicant: Andrej Pančur, INZ

Funds received: € 3,000

During the project, the siParl corpus has been created. It contains all the parliamentary minutes of the National Assembly of the Republic of Slovenia between 1990 and 2018 (until the end of the 7th legislative period) as well as all the minutes of the National Assembly’s working bodies since 1996, all of which amounts to almost 230 million of tokens in total. The parliamentary minutes from the 1990–1992 period have been obtained from the existing SlovParl 2.0 corpus, while the rest of the minutes have been newly tagged. The tagging has been completed in the TEI module for drama texts and converted into the TEI module for speech transcription. The corpus includes data about the speeches and the speakers, non-verbal content of the session minutes and relevant metadata. The content of the speeches has also been linguistically tagged, i.e. tokenised, morphosyntactically tagged and lemmatised.

The siParl corpus is available through the concordance software and for download under the CC BY licence at:

  • Pančur, Andrej; Erjavec, Tomaž; Ojsteršek, Mihael; Šorn, Mojca and Blaj Hribar, Neja, 2019, Slovenian parliamentary corpus siParl 1.0 (1990-2018), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1236.

Assigning stress to the Sloleks lexicon

Applicant: Špela Arhar Holdt, CJVT UL

Funds received: € 5,000

In the project, the latest version of the Sloleks morphological lexicon of Slovene was improved with the addition of automatically assigned accents, a portion of which were also manually evaluated. The interface of the lexicon was also upgraded to facilitate crowdsourcing of the newly added data. The project focused on the lemmas in which the position of the accent is fixed on the word stem. In the first step, accents were automatically assigned to all word forms in the lexicon. Through existing dictionary resources, 55% of the automatically assigned accents were confirmed with an estimated accuracy of 75%. 24% of the lexicon data was processed manually, the majority with the use of crowdsourcing. Counting both the results of automatic as well as manual approaches, the project corrected 21.7% of the automatically assigned accents. Future work will include proper nouns as well as lemmas with non-fixed accents and accent variants.

The project also upgraded the design of the user interface: (a) by implementing the graphic design developed for CJVT resources; and (b) by upgrading the interface with features that allow the community to participate in database clean-up (i.e. allowing them to upvote/downvote the assigned accents, the automatically generated phonetic transcriptions and the text-to-speech pronunciation). Additional functionalities are also being developed as part of other ongoing projects, such as the possibility for users to contribute recordings of their own pronunciation of words.

The database is available under the CC BY-NC-SA licence at:

The compilation of word and n-gram lists for various levels of education and for different subjects

Applicant: Iztok Kosem, Faculty of Arts, University of Ljubljana

Funds received: € 4,000

The project involved compiling a corpus of textbooks used in Slovenian elementary and secondary schools, and extracting word and n-gram lists and keywords. The collected textbooks were available in PDF and html formats and were converted into text format. Afterwards, the converted texts were examined, issues corrected and then the texts were POS-tagged. The corpus contains around 5 million tokens or 127 textbooks from 16 different subjects. The second step involved the extraction of word lists, etc., and several measures to ensure the quality of data. In addition, lists were manually analysed. The final result are represented by the following lists:

  • List of general words occurring in at least 8 out of 16 subjects. The list contains information on lemma, word form and frequency (by education level and the number of subjects).
  • List of general words by education level (grade/year) containing information on lemma, word class, frequency and number of levels in which the lemma was found.
  • List of 2-5-grams containing the word-forms of the n-gram, its lemmas, word classes and POS-tags, its frequency and the number of subjects in which the n-gram was found.

The lists are available under 4.0 CC BY licence at:

  • Kosem, Iztok; Pori, Eva and Arhar Holdt, Špela, 2019, Keywords and n-grams from a textbook corpus, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1215.

A Tool for efficient analysis of Slovene corpora

Applicants: Marko Robnik Šikonja, Špela Arhar Holdt, UL FRI

Funds received: € 4,000

In the project, a clear and comprehensible user interface for the corpusStatistics tool (renamed to LIST) was developed. The tool offers a user-friendly access to language statistics in corpora of Slovene and other languages. The tool was adapted for several corpus formats and tested on large corpora of Slovene and other languages.

The program now includes metadata in all outputs which enables the reproducibility of the results. The elements of the user interface contain short explanations shown on mouse-over. Several new association measures of word sequences are supported, e.g. Dice, t-score, MI, and MI3. The program now estimates the time needed to return the results and warns users for settings which may require a longer processing time. Users can now also switch between the Slovene and English interfaces and can process non-Latin scripts. The program was upgraded with support for the TEI P5 format used for recently published corpora in the CLARIN.SI repository, and the vertical format (VERT) used by SketchEngine.

The LIST program is available under the Apache2 open licence:

Gos Videolectures II

Applicants: Darinka Verdonik, Andrej Žgank, UM FERI

Funds received: € 6,000

The goal of the Gos VideoLectures II project was to enlarge the existing Gos VideoLectures database with additional 8 hours of manually performed transcriptions of the selected speeches from Videolectures.net database. Transcriptions were done in two-level transcriptions system, where the first level represents conversational transcription, and the second level represents standardized transcription as defined in the Slovene GOS corpus. The speech signal was manually segmented to utterances/segments and notable acoustic events were manually annotated. The second goal was to automatically segment speech signal to words and to a restricted list of phonemes. This was done with the adapted version of the automatic speech recognition system for Slovene UMB Broadcast News developed at the Faculty of Electrical Engineering and Computer Science of University of Maribor.

Same as in the previous versions of the Gos VideoLectures database, the conversion of transcriptions from the Transcriber 1.5.1 XML format to metadata files in TEI (module for speech corpora) was performed. Conversion to TEI includes the list of speakers with their metadata, metadata about speeches, co-alignment of utterances and sentences with speech signal, coding of acoustic events and alignment between conversational and standardized transcription. Additionally, words in standardized transcription were automatically lemmatized and tagged with MSD. Along with the conversion, validation of the source files was done and a number of errors was detected and corrected. Based on the TEI files, we created a vertical file that is needed for the import of the database into CLARIN.SI concordancers. Audio files were also adapted for the import into concordancers so that it is now possible to listen to the recordings while searching through Gos VideoLectures corpus.

The corpus is available through CLARIN.SI concordancers and for download at:

The multimedia database of the dictionary of the clothing terminology of the Zilja local dialect of Canale Valley (Val Canale – Kanaltal – Valcjanâl)

Applicant: Carmen Kenda-Jež, ZRC SAZU

Funds received: € 4,000

The multimedia database for The Dictionary of the Clothing terminology of the Zilja Local dialect of Canale Valley, published on the FRAN portal, was created from the collection of dialect material previously used for two printed editions of the dictionary. The transfer to the digital environment resulted in the formal adaptation to the new media (e.g. the manner of data presentation, replacement of the abbreviations with their full form or the unification of grammatical qualifiers) and in the range of microstructural changes that were caused by the self-contained presentation of the online dictionary entry and its direct links to the sound clip collection. The final version of the online dictionary is therefore substantially different from its printed versions.

The dictionary, which contains 594 entries, was transformed from the Word format to the dictionary database in XML and equipped with the intra- and inter-dictionary links. The original collection of sound clips was revised. Sound clips of lower quality (e.g. those with overlapping speech) were eliminated. If possible, the new sound material was gathered with additional analysis of previously used soundtracks. The sound clips were linked with dialect lemmas and examples.

Selected photographic material from the ethnographic research archive of clothing culture has been added to the database. For some of the entries the connection with the ethnographic online collection Glasovi Kanalske doline (The voices of Canale Valley) of the Zborzbirk project Kulturna dediščina v zbirkah med Alpami in Krasom (Cultural heritage in the collections between the Alps and the Karst) has been established. The portal Fran gives access to the monographs of the local Ziljsko dialect –Ovčja vas in njena slovenska govorica (Ovčja vas and its Slovenian Speech), 2005; Lipalja vas in njena slovenska govorica (Lipalja vas and its Slovenian Speech, 2016), providing open access as a part of the project.

The database is available at:

  • Kenda-Jež, Karmen; Perdih, Andrej and Race, Duša, 2019, The Dictionary of the Clothing Terminology of the Zilja Dialect of Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1217.
  • Gliha Komac, Nataša; Kandutsch, Elisa; Bartaloth, Rudi and Smole, Matevž, 2019, The Dictionary of the Clothing Terminology of the Zilja Local Dialect of Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl): photographs, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1221.
  • Kenda-Jež, Karmen, 2019, The Dictionary of the Clothing Terminology of the Zilja Dialect of Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl): audio, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1220.