CLARIN.SI PROJECT REPORTS 2019
Following the sucessfull introduction of this initiative in 2018, CLARIN.SI in 2019 again launched a call for project proposals for the members of its consortium. The call, with a budget of €30,000, targeted projects that either built or upgraded resources or services that contribute to the advancement of the CLARIN.SI mission. Six project proposals were accepted with their descriptions and the resources they produced given below.
Tool for statistical analysis of dependency-parsed corpora
Applicants: Kaja Dobrovoljc (FF UL), Marko Robnik Šikonja (FRI UL)
Funds received: € 6,000
Within the project, we have developed a computer program for statistical analysis of parsed corpora (the STARK tool) that produces frequency lists of trees from dependency parsed corpora. The user defines the type of trees to be extracted through several parameters in the configuration file, such as the number of nodes in the tree and their type (from word forms to abstract grammatical categories), and the potential differentiation of trees based on their completeness, labelling and surface word order. In addition to such bottom-up approach to dependency tree extraction which does not rely on any linguistic assumptions, the tool also enables tree extractions based on additional restrictions and queries with pre-defined tree structures. The results are displayed in the form of a tabular text file with information on the tree structure and its nodes as well as on the corpus frequency and the strength of statistical association between nodes through different association measures. The tool expects the standard CONLL-U format as input, making it directly applicable not only to Slovenian corpora, such as the ssj500k treebank or the 1-billion-word Gigafida reference corpus, but also to more than 70 other languages with the same type of data already available.
The STARK command-line tool is publicly available through the Apache 2.0 license at https://gitea.cjvt.si/lkrsnik/STARK, and can also be downloaded from the CLARIN.SI repository:
- Krsnik, Luka; Dobrovoljc, Kaja and Robnik-Šikonja, Marko, 2019, Dependency tree extraction tool STARK 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1284.
Establishing access to historical versions of Slovenian language reference corpus Gigafida
Applicant: Andraž Repar, CJVT
Funds received: € 1,500
The CLARIN.SI concordancers noSketch Engine and KonText offered only the latest version of the Slovenian reference corpus Gigafida v2.0. This corpus has newly linguistically annotated and, contrary to its previous versions, does not include non-standard texts.
For various reasons, it is sometimes necessary to access previous versions of the Gigafida corpus, e.g. to access the removed non-standard texts (this is esp. relevant for research on Slovenian in neighbouring countries, because sources containing this version of Slovene (e.g., the newsletter Novi Matajur) were removed due to their nonstandardness). Additionally, access to the older versions enables replicability of previous research performed on this corpus.
The project enabled the previous versions of the Gigafida corpus (FidaPLUS, Gigafida 1.0 and Gigafida 1.1) to be mounted on the CLARIN.SI noSketch Engine and KonText platforms. The plan was also to mount the first version of the corpus, so called FIDA, where agreements were signed with the copyright holders, i.e. the companies Amebis, d.o.o. and DZS, d.d. Unfortunately, the project funds were only sufficient to cover the copyright transfer from DZS to the Ljubljana University, with none left over to enable the transfer of data from CD ROMs to the digital form necessary to publish them on the two CLARIN.SI platforms.
CLARIN.SI noSketch Engine and KonText now mount the following versions of the Gigafida corpus:
- Gigafida v2.0 proto (nededupliciran): noSketch Engine, KonText,
- Gigafida v2.0 (dedupliciran): noSketch Engine, KonText,
- Gigafida v1.1 (nededupliciran): noSketch Engine, KonText,
- Gigafida v1.1 dedup (dedupliciran): noSketch Engine, KonText,
- Gigafida v1.0: noSketch Engine, KonText,
- FidaPLUS: noSketch Engine, KonText.
Corpus for Slovene coreference resolution and aspect-based sentiment analysis–SentiCoref 1.0
Applicant: Slavko Žitnik, FRI UL
Funds received: € 6,000
The aim of the project was to compile the SentiCoref 1.0 corpus which includes sentiment annotations for specific entities in the text. In addition to the sentiment level annotation, coreferences and named entities were also tagged. Named entities include person names, organization names and locations. Each named entity is annotated along with all the coreferent mentions that refer to an underlying entity. The corpus enables better coreference resolution analyses and aspect-based sentiment analysis for the Slovene language.
SentiCoref 1.0 corpus contains texts from SentiNews 1.0 corpus (Bučar, 2017) that consists of 10,427 documents. Each of the documents from SentiNews 1.0 corpus is annotated with a five-level sentiment on a level of document, paragraph and sentence. SentiCoref 1.0 consists of 837 documents selected from SentiNews 1.0 based on the number of named entities (automatically tagged using Polyglot tool) which contain between 50 to 73 named entities.
SentiCoref 1.0 corpus consists of 31,419 named entities: 15,285 organization names, 8,606 person names and 7,528 locations. All the documents form 14,572 coreference chains (i.e., entities) with 438,733 entity mentions. Entities are annotated using the following sentiment levels: very negative: 30 entities; negative: 1,801 entities; neutral: 10,869 entities; positive: 1,705 entities; very positive: 24 entities.
The SentiCoref 1.0 corpus along with the annotation guidelines is available under the CC BY 4.0 licence at:
- Žitnik, Slavko, 2019, Slovene corpus for aspect-based sentiment analysis – SentiCoref 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1285.
Speech corpus of dialogue acts GORDAN 1.0
Applicant: Darinka Verdonik, FERI UM
Funds received: 6,000 €
During the project Speech corpus of dialogue acts GORDAN 1.0 we developed the dialogue act corpus for Slovene. The corpus contains a balanced sample of different types of spoken discourse in total length of one hour. The data was selected from previously existing Slovene corpora (the GOS, Gos VideoLectures and BERTA) according to four criteria: public/non-public, interactive/monologic, channel and intention.
Before selecting and defining the annotation scheme, four well-established schemes (MRDA, AMI, ISO 24617-2 and DART) were evaluated based on the following criteria: ensuring annotation of pragmatic meaning, coherent structure, general validity and well-balanced structure. Substantial drawbacks regarding these criteria were found in all of the existent schemes. Based on these findings, we have defined the GORDAN 1.0 scheme which keeps the advantages of the analysed schemes and overcomes their drawbacks.
The selected data has been annotated in accordance with the GORDAN 1.0 annotation scheme in the Transcriber 1.5.1 tool using its function Event. If the video recording was available, the annotators used multimodal data coupling the audio and video recording.
The data are available as two separate datasets:
- Zwitter Vitez, Ana; et al., 2020, Dialogue act annotated spoken corpus GORDAN 1.0 (audio/video), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1292: includes original audio recordings (and video recordings if available) that are downloadable under the original licence, i.e., CC BY-NC-ND 4.0.
- Verdonik, Darinka, 2020, Dialogue act annotated spoken corpus GORDAN 1.0 (transcription), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1291: includes transcriptions, dialogue act annotations and the GORDAN 1.0 annotation scheme specification that can be distributed under the CC BY 4.0 licence.
Slovene metaphor corpus Komet 1.0
Applicant: Špela Antloga, FERI UM
Funds received: € 4,000
The Komet corpus is a hand-annotated corpus of metaphorical expressions which covers 200,000 lexical units from Slovene journalistic, fiction and on-line texts. Metaphors are a complex phenomenon which can be rendered on the linguistic level by novel and creative expressions, or strongly lexicalized units that are hardly noticeable as metaphorical. Understanding the complexity of metaphor phenomenon and the need for clearly defined guidelines for metaphor identification in texts, a group of English linguists developed a procedure for metaphor identification in text: the MIPVU protocol. Since the research on metaphors in Slovene has been very unsystematic and vague, an adapted and modified procedure was used to create a Slovene corpus of metaphors. In this corpus, lexical units (words) without the same contextual and basic meaning are considered metaphor-related words. Basic and contextual meaning for each word in the corpus was defined using Dictionary of the standard Slovene Language. Corpus was annotated for the following relations to metaphor: indirect metaphor, direct metaphor, borderline cases and metaphor signals. In addition, the corpus is also annotated with conceptual frames which holds information about a concept to which it refers. This conceptual frame allows us to search for figurative expressions within a specific context category (e.g., time, spatial orientation, emotions, etc.). The Slovene metaphor corpus Komet enables objective and systematic analysis of metaphorical expressions and metaphors in various Slovene texts.
The Komet corpus is available under the CC BY-NC-SA 4.0 licence at:
- Antloga, Špela, 2020, Metaphor corpus KOMET 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1293.
Placing new orthographical rules on the Fran portal
Applicant: ZRC SAZU, the Scientific Research Center of the Slovenian Academy of Sciences and Arts
Funds received: € 6,000
The project launched a public presentation of new orthographical rules and corresponding dictionary entries to supplement the rules on the Fran portal. The drafts of the first two chapters of the orthographical rules allow users to participate in a public debate on the adequacy of the proposed solutions and their content.
For this purpose, each chapter of the new spelling rules was converted from Word .docx text format to TEI XML, with a converter developed for this purpose. In this way, orthographical rules are available in accordance with international recommendations, thus facilitating their further development, maintenance, and distribution, as well as their connectivity and adaptability to different uses. The TEI-encoded rules are linked to the Slovenian Normative Guide Dictionary (e-Pravopis).
Namely, simultaneously with the revision of individual chapters of the orthographical rules at the Fran Ramovš Institute of the Slovenian Language, ZRC SAZU is creating a Slovenian Normative Guide Dictionary – ePravopis. Linking ePravopis to the rules is an important step, because it combines the information from the dictionary and the rules. Users are thus offered insights that such guides have not allowed so far.
The first two chapters are available on the Fran portal, while the TEI files will be available on the CLARIN.SI repository under CC BY-NC 4.0 license once all the chapters have been prepared.
CLARIN.SI PROJECT REPORTS 2018
In 2018, CLARIN.SI launched for the first time a call for project proposals for the members of its consortium. The call, with a budget of €30,000, targeted projects that either built or upgraded resources or services that contribute to the advancement of the CLARIN.SI mission. Seven project proposals were accepted with their descriptions and the resources they produced given below.
Upgrade of the eZISS digital library of text-critical editions of Slovene literature
Applicants: Andrej Pančur, INZ, Matija Ogrin, ZRC SAZU
Funds received: € 4,000
The project has upgraded two very complex and extensive editions which include diverse components and realise a variety of text-critical concepts of analysing and displaying texts. In addition, it developed a significantly improved display of the electronic edition, its internal structure (transcriptions, digital facsimiles, notes, critical apparatus, the accompanying scientific commentary) as well as the links between the components. The existing XSLT transformations from the GitHub repository (https://github.com/SIstory/Stylesheets) have been adapted and for the purpose of ensuring dynamic display of the parallel sections upgraded with XSLT 3.0 transformation for SAXON-JS. The XSLT transformations are accessible in the “Profiles” folder at https://github.com/DARIAH-SI/Foglar-pub and https://github.com/DARIAH-SI/Kapelski-pub.
The project has also entailed editorial work on both editions:
- Kapelski pasijon (The Železna Kapla Passion Play): the tagging in the TEI markup language has been improved, the scientific commentary has been partly reorganised and all the transcriptions have been linked with the associated digital facsimile files and mutual references.
- Foglarjev rokopis (The Foglar Manuscript): a complete digital edition has been created with the diplomatic and critical transcription of the manuscript, including the apparatus of variants for the several handwritten versions of the poems under consideration. The edition has been prepared by Nina Ditmajer. Both transcriptions have been linked with digital facsimiles and their tagging in the TEI markup language has been adapted to the various possibilities of displaying and linking the texts.
An important motive and aspect of the process upgrade is its usefulness for the future digital editions of the eZISS library in the context of the DARIAH-SI research infrastructure. Namely, DARIAH-SI aims to establish a TEI-based digital library enabling the presentation of complex digital editions such as the Železna Kaplja Passion Play or Foglar Manuscript, and a connection to the corpus analysis services at CLARIN.SI.
The Železna kaplja passion play is accessible at:
The Foglar manuscript is accessible at:
The corpus of parliamentary minutes of the National Assembly of the Republic of Slovenia 1990-2018
Applicant: Andrej Pančur, INZ
Funds received: € 3,000
During the project, the siParl corpus has been created. It contains all the parliamentary minutes of the National Assembly of the Republic of Slovenia between 1990 and 2018 (until the end of the 7th legislative period) as well as all the minutes of the National Assembly’s working bodies since 1996, all of which amounts to almost 230 million of tokens in total. The parliamentary minutes from the 1990–1992 period have been obtained from the existing SlovParl 2.0 corpus, while the rest of the minutes have been newly tagged. The tagging has been completed in the TEI module for drama texts and converted into the TEI module for speech transcription. The corpus includes data about the speeches and the speakers, non-verbal content of the session minutes and relevant metadata. The content of the speeches has also been linguistically tagged, i.e. tokenised, morphosyntactically tagged and lemmatised.
The siParl corpus is available through the concordance software and for download under the CC BY licence at:
- Pančur, Andrej; Erjavec, Tomaž; Ojsteršek, Mihael; Šorn, Mojca and Blaj Hribar, Neja, 2019, Slovenian parliamentary corpus siParl 1.0 (1990-2018), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1236.
Assigning stress to the Sloleks lexicon
Applicant: Špela Arhar Holdt, CJVT UL
Funds received: € 5,000
In the project, the latest version of the Sloleks morphological lexicon of Slovene was improved with the addition of automatically assigned accents, a portion of which were also manually evaluated. The interface of the lexicon was also upgraded to facilitate crowdsourcing of the newly added data. The project focused on the lemmas in which the position of the accent is fixed on the word stem. In the first step, accents were automatically assigned to all word forms in the lexicon. Through existing dictionary resources, 55% of the automatically assigned accents were confirmed with an estimated accuracy of 75%. 24% of the lexicon data was processed manually, the majority with the use of crowdsourcing. Counting both the results of automatic as well as manual approaches, the project corrected 21.7% of the automatically assigned accents. Future work will include proper nouns as well as lemmas with non-fixed accents and accent variants.
The project also upgraded the design of the user interface: (a) by implementing the graphic design developed for CJVT resources; and (b) by upgrading the interface with features that allow the community to participate in database clean-up (i.e. allowing them to upvote/downvote the assigned accents, the automatically generated phonetic transcriptions and the text-to-speech pronunciation). Additional functionalities are also being developed as part of other ongoing projects, such as the possibility for users to contribute recordings of their own pronunciation of words.
The database is available under the CC BY-NC-SA licence at:
- Dobrovoljc, Kaja; et al., 2019, Morphological lexicon Sloleks 2.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1230.
The compilation of word and n-gram lists for various levels of education and for different subjects
Applicant: Iztok Kosem, Faculty of Arts, University of Ljubljana
Funds received: € 4,000
The project involved compiling a corpus of textbooks used in Slovenian elementary and secondary schools, and extracting word and n-gram lists and keywords. The collected textbooks were available in PDF and html formats and were converted into text format. Afterwards, the converted texts were examined, issues corrected and then the texts were POS-tagged. The corpus contains around 5 million tokens or 127 textbooks from 16 different subjects. The second step involved the extraction of word lists, etc., and several measures to ensure the quality of data. In addition, lists were manually analysed. The final result are represented by the following lists:
- List of general words occurring in at least 8 out of 16 subjects. The list contains information on lemma, word form and frequency (by education level and the number of subjects).
- List of general words by education level (grade/year) containing information on lemma, word class, frequency and number of levels in which the lemma was found.
- List of 2-5-grams containing the word-forms of the n-gram, its lemmas, word classes and POS-tags, its frequency and the number of subjects in which the n-gram was found.
The lists are available under 4.0 CC BY licence at:
- Kosem, Iztok; Pori, Eva and Arhar Holdt, Špela, 2019, Keywords and n-grams from a textbook corpus, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1215.
A Tool for efficient analysis of Slovene corpora
Applicants: Marko Robnik Šikonja, Špela Arhar Holdt, UL FRI
Funds received: € 4,000
In the project, a clear and comprehensible user interface for the corpusStatistics tool (renamed to LIST) was developed. The tool offers a user-friendly access to language statistics in corpora of Slovene and other languages. The tool was adapted for several corpus formats and tested on large corpora of Slovene and other languages.
The program now includes metadata in all outputs which enables the reproducibility of the results. The elements of the user interface contain short explanations shown on mouse-over. Several new association measures of word sequences are supported, e.g. Dice, t-score, MI, and MI3. The program now estimates the time needed to return the results and warns users for settings which may require a longer processing time. Users can now also switch between the Slovene and English interfaces and can process non-Latin scripts. The program was upgraded with support for the TEI P5 format used for recently published corpora in the CLARIN.SI repository, and the vertical format (VERT) used by SketchEngine.
The LIST program is available under the Apache2 open licence:
- Krsnik, Luka; et al., 2019, Corpus extraction tool LIST 1.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1227.
Gos Videolectures II
Applicants: Darinka Verdonik, Andrej Žgank, UM FERI
Funds received: € 6,000
The goal of the Gos VideoLectures II project was to enlarge the existing Gos VideoLectures database with additional 8 hours of manually performed transcriptions of the selected speeches from Videolectures.net database. Transcriptions were done in two-level transcriptions system, where the first level represents conversational transcription, and the second level represents standardized transcription as defined in the Slovene GOS corpus. The speech signal was manually segmented to utterances/segments and notable acoustic events were manually annotated. The second goal was to automatically segment speech signal to words and to a restricted list of phonemes. This was done with the adapted version of the automatic speech recognition system for Slovene UMB Broadcast News developed at the Faculty of Electrical Engineering and Computer Science of University of Maribor.
Same as in the previous versions of the Gos VideoLectures database, the conversion of transcriptions from the Transcriber 1.5.1 XML format to metadata files in TEI (module for speech corpora) was performed. Conversion to TEI includes the list of speakers with their metadata, metadata about speeches, co-alignment of utterances and sentences with speech signal, coding of acoustic events and alignment between conversational and standardized transcription. Additionally, words in standardized transcription were automatically lemmatized and tagged with MSD. Along with the conversion, validation of the source files was done and a number of errors was detected and corrected. Based on the TEI files, we created a vertical file that is needed for the import of the database into CLARIN.SI concordancers. Audio files were also adapted for the import into concordancers so that it is now possible to listen to the recordings while searching through Gos VideoLectures corpus.
The corpus is available through CLARIN.SI concordancers and for download at:
- NET, 2019, Spoken corpus Gos VideoLectures 4.0 (audio), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1222.
- Verdonik, Darinka; et al., 2019, Spoken corpus Gos VideoLectures 4.0 (transcription), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1223.
The multimedia database of the dictionary of the clothing terminology of the Zilja local dialect of Canale Valley (Val Canale – Kanaltal – Valcjanâl)
Applicant: Carmen Kenda-Jež, ZRC SAZU
Funds received: € 4,000
The multimedia database for The Dictionary of the Clothing terminology of the Zilja Local dialect of Canale Valley, published on the FRAN portal, was created from the collection of dialect material previously used for two printed editions of the dictionary. The transfer to the digital environment resulted in the formal adaptation to the new media (e.g. the manner of data presentation, replacement of the abbreviations with their full form or the unification of grammatical qualifiers) and in the range of microstructural changes that were caused by the self-contained presentation of the online dictionary entry and its direct links to the sound clip collection. The final version of the online dictionary is therefore substantially different from its printed versions.
The dictionary, which contains 594 entries, was transformed from the Word format to the dictionary database in XML and equipped with the intra- and inter-dictionary links. The original collection of sound clips was revised. Sound clips of lower quality (e.g. those with overlapping speech) were eliminated. If possible, the new sound material was gathered with additional analysis of previously used soundtracks. The sound clips were linked with dialect lemmas and examples.
Selected photographic material from the ethnographic research archive of clothing culture has been added to the database. For some of the entries the connection with the ethnographic online collection Glasovi Kanalske doline (The voices of Canale Valley) of the Zborzbirk project Kulturna dediščina v zbirkah med Alpami in Krasom (Cultural heritage in the collections between the Alps and the Karst) has been established. The portal Fran gives access to the monographs of the local Ziljsko dialect –Ovčja vas in njena slovenska govorica (Ovčja vas and its Slovenian Speech), 2005; Lipalja vas in njena slovenska govorica (Lipalja vas and its Slovenian Speech, 2016), providing open access as a part of the project.
The database is available at:
- Kenda-Jež, Karmen; Perdih, Andrej and Race, Duša, 2019, The Dictionary of the Clothing Terminology of the Zilja Dialect of Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl), Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1217.
- Gliha Komac, Nataša; Kandutsch, Elisa; Bartaloth, Rudi and Smole, Matevž, 2019, The Dictionary of the Clothing Terminology of the Zilja Local Dialect of Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl): photographs, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1221.
- Kenda-Jež, Karmen, 2019, The Dictionary of the Clothing Terminology of the Zilja Dialect of Canale Valley (Kanalska dolina – Val Canale – Kanaltal – Valcjanâl): audio, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1220.