Slovenian language resource repository CLARIN.SIThe CLARIN.SI digital repository system captures, stores, indexes, preserves, and distributes digital research material.https://www.clarin.si:443/repository/xmlui2024-03-14T00:33:24Z2024-03-14T00:33:24ZSlovenian Emotion Dimension and Emotion Association Lexicon SloEmoLex 1.0Brglez, MojcaCaporusso, JayaHoogland, DamarKoloski, BoshkoPollak, SenjaPurver, Matthewhttp://hdl.handle.net/11356/18752024-03-13T15:29:11Z2024-03-05T00:00:00ZSlovenian Emotion Dimension and Emotion Association Lexicon SloEmoLex 1.0
Brglez, Mojca; Caporusso, Jaya; Hoogland, Damar; Koloski, Boshko; Pollak, Senja; Purver, Matthew
SloEmoLex is a lexicon of emotion, valence, arousal and dominance for 19,998 Slovenian entries.
It includes and extends the Slovenian part of the LiLaH lexicon (Ljubešić et al., 2020; http://hdl.handle.net/11356/1318), in which words are annotated with binary values for association to one of the 8 basic emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust) and binary values for association with positive/negative sentiment.
SloEmoLex extends the LiLaH emotion lexicon with VAD scores from NRC VAD v1 (http://saifmohammad.com/WebPages/nrc-vad.html), and emotion intensity scores from NRC Emotion Intensity lexicon v1 (http://saifmohammad.com/WebPages/AffectIntensity.htm). Apart from the approx. 14,000 words present in Lilah, the lexicon includes 5,931 additional entries from the NRC VAD lexicon, some of which were translated with the use of sloWNet 3.1 (http://hdl.handle.net/11356/1026), and some entries (3,273) retained the machine translation provided in the Slovenian part of the NRC VAD lexicon.
If you use this work, please cite our paper:
Caporusso, Jaya, Hoogland, Damar, Brglez, Mojca, Kolosko, Boshko, Purver, Matthew, and Pollak, Senja, (to appear in 2024). A Computational Analysis of the Dehumanisation of Migrants from Syria and Ukraine in Slovene News Media. To be presented at THE 2024 JOINT INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS, LANGUAGE RESOURCES AND EVALUATION (LREC-COLING 2024) 20-25 MAY, 2024 / TORINO, ITALIA.
2024-03-05T00:00:00ZSlovenian Semantic Lexicon sloWNet-USAS 1.0Brglez, MojcaPahor de Maiti Tekavčič, Kristinahttp://hdl.handle.net/11356/19252024-03-12T16:04:33Z2024-03-10T00:00:00ZSlovenian Semantic Lexicon sloWNet-USAS 1.0
Brglez, Mojca; Pahor de Maiti Tekavčič, Kristina
This entry is an extension of the Slovenian semantic lexicon sloWNet 3.1 (http://hdl.handle.net/11356/1026) which is enriched with semantic tags following the USAS ontology. The USAS ontology (Piao et al., 2005) is part of the UCREL semantic analysis system and is used for general language semantic description (https://ucrel.lancs.ac.uk/usas/). It consists of 21 major semantic fields (e.g., PHYSICAL ATTRIBUTES [O4]) and more than 400 semantic subcategories (e.g., Temperature [O4.6], Temperature : Cold [O4.6-]) that group together words belonging to the same mental concepts. The semantic tags were translated into Slovene and then automatically mapped onto the sloWNet entries from the USAS semantic lexicon following the algorithmic steps described in the README file. This procedure assigned semantic tags to 41,135 unique entries. The semantic tags were also given concreteness scores, calculated according to the procedure described in the README file.
The file USAS_sl_conc.tsv contains the complete USAS tagset, including the concreteness scores of semantic domains, and their Slovenian descriptions. The file sloWNet_USAS_1.0.tsv contains lexemes from sloWNet 3.1 paired with the semantic tag of their most literal, basic sense as identified by the algorithm, and all the semantic tag candidates from which the closest tag was sourced, in a tabular format. The resource was originally used to facilitate metaphor analysis, but can be helpful also for other tasks such as text classification and sentiment analysis.
2024-03-10T00:00:00ZMonitor corpus of Slovene Trendi 2024-02Kosem, IztokČibej, JakaDobrovoljc, KajaErjavec, TomažLjubešić, NikolaPonikvar, PrimožŠinkec, MihaelKrek, Simonhttp://hdl.handle.net/11356/19242024-03-06T15:51:26Z2024-03-06T00:00:00ZMonitor corpus of Slovene Trendi 2024-02
Kosem, Iztok; Čibej, Jaka; Dobrovoljc, Kaja; Erjavec, Tomaž; Ljubešić, Nikola; Ponikvar, Primož; Šinkec, Mihael; Krek, Simon
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 70 publishers. Trendi 2024-02 covers the period from January 2019 to February 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320).
The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).
An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).
The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem (iztok.kosem@ijs.si).
This version adds texts from February 2024.
2024-03-06T00:00:00ZService for querying dependency treebanks Drevesnik 1.1Štravs, MihaDobrovoljc, Kajahttp://hdl.handle.net/11356/19232024-03-04T11:13:07Z2024-03-01T00:00:00ZService for querying dependency treebanks Drevesnik 1.1
Štravs, Miha; Dobrovoljc, Kaja
Drevesnik (https://orodja.cjvt.si/drevesnik/) is an online service for querying Slovenian corpora parsed with the Universal Dependencies annotation scheme. It features an easy-to-use query language on the one hand and user-friendly graph visualizations on the other. It is based on the open-source dep_search tool (https://github.com/TurkuNLP/dep_search), which was localized and modified so as to also support querying by JOS morphosyntactic tags, random distribution of results, and filtering by sentence length.
The source code and the documentation for the search backend and the web user interface are publicly available on the CLARIN.SI GitHub repository https://github.com/clarinsi/drevesnik. This submission corresponds to release 1.1: https://github.com/clarinsi/drevesnik/releases/tag/1.1, which brings improved architecture, documentation and branding in comparison to release 1.0.
2024-03-01T00:00:00ZThe news articles reporting on the 2021 Tokyo Olympics data set OG2021 (public)Novak, ErikCalcina, ErikMladenić, DunjaGrobelnik, Markohttp://hdl.handle.net/11356/19222024-02-27T12:38:45Z2024-02-27T00:00:00ZThe news articles reporting on the 2021 Tokyo Olympics data set OG2021 (public)
Novak, Erik; Calcina, Erik; Mladenić, Dunja; Grobelnik, Marko
The OG2021 corpus contains multilingual news articles that are reporting on the events happening during the 2021 Tokyo Olympics. The data set was created to evaluate the clustering algorithm. The articles were initially acquired via the EventRegistry service, clustered using an online news clustering algorithm, and finally manually inspected and annotated by a single evaluator using translation services to understand the meaning of the articles' content.
The corpus consists of a single file called og2021.csv, which contains the data of 10.940 news articles grouped into 1.350 clusters. Each article has the following attributes:
- id: The ID of the news article.
- title: The title of the article.
- lang: The language in which the article is written. Can be one of nine values.
- source: The news publisher's name.
- published_at: The date and time when the article was published. The published dates range between 2021-07-01 and 2021-08-14.
- URL: The URL location of the news article.
- cluster_id: The ID of the cluster the article is a member of.
The dataset is also published with the body attribute but under a more restrictive licence. It can be found at http://hdl.handle.net/11356/1921.
2024-02-27T00:00:00ZThe news articles reporting on the 2021 Tokyo Olympics data set OG2021 (research)Novak, ErikCalcina, ErikMladenić, DunjaGrobelnik, Markohttp://hdl.handle.net/11356/19212024-02-27T12:37:58Z2024-02-27T00:00:00ZThe news articles reporting on the 2021 Tokyo Olympics data set OG2021 (research)
Novak, Erik; Calcina, Erik; Mladenić, Dunja; Grobelnik, Marko
The OG2021 corpus contains multilingual news articles that are reporting on the events happening during the 2021 Tokyo Olympics. The data set was created to evaluate the clustering algorithm. The articles were initially acquired via the EventRegistry service, clustered using an online news clustering algorithm, and finally manually inspected and annotated by a single evaluator using translation services to understand the meaning of the articles' content.
The corpus consists of a single file called og2021.csv, which contains the data of 10.940 news articles grouped into 1.350 clusters. Each article has the following attributes:
- id: The ID of the news article.
- title: The title of the article.
- body: The body of the article.
- lang: The language in which the article is written. Can be one of nine values.
- source: The news publisher's name.
- published_at: The date and time when the article was published. The published dates range between 2021-07-01 and 2021-08-14.
- URL: The URL location of the news article.
- cluster_id: The ID of the cluster the article is a member of.
2024-02-27T00:00:00ZParliamentary spoken corpus of Serbian ParlaSpeech-RS 1.0Ljubešić, NikolaRupnik, PeterKoržinek, Danijelhttp://hdl.handle.net/11356/18342024-02-08T15:55:46Z2024-02-08T00:00:00ZParliamentary spoken corpus of Serbian ParlaSpeech-RS 1.0
Ljubešić, Nikola; Rupnik, Peter; Koržinek, Danijel
The ParlaSpeech-RS dataset is built from the transcripts of parliamentary proceedings available in the Serbian part of the ParlaMint (ParlaMint-RS) corpus, and the parliamentary recordings available from the Serbian Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.
2024-02-08T00:00:00ZThesaurus of Modern Slovene 2.0Krek, SimonLaskowski, CyprianRobnik-Šikonja, MarkoKosem, IztokArhar Holdt, ŠpelaGantar, PolonaČibej, JakaGorjanc, VojkoKlemenc, BojanDobrovoljc, Kajahttp://hdl.handle.net/11356/19162024-02-06T11:37:19Z2023-11-15T00:00:00ZThesaurus of Modern Slovene 2.0
Krek, Simon; Laskowski, Cyprian; Robnik-Šikonja, Marko; Kosem, Iztok; Arhar Holdt, Špela; Gantar, Polona; Čibej, Jaka; Gorjanc, Vojko; Klemenc, Bojan; Dobrovoljc, Kaja
Thesaurus of Modern Slovene is the largest automatically generated open-access collection of Slovene synonyms. It is sourced from the data in two principal language resources: The Oxford®-DZS Comprehensive English-Slovenian Dictionary and the Gigafida 1.0 corpus of written Slovene. The links identified between synonyms were additionally confirmed using the Dictionary of Standard Slovenian Language (SSKJ). The data extraction and structure for the Thesaurus were based on the frequency and manner in which words co-occur in translation strings of the Oxford-DZS Dictionary. This information is the basis for discriminating between ‘core’ and ‘near’ synonyms, with ‘core’ synonyms exhibiting a greater connection to the keyword. In the following step, an approach combining balanced co-occurrence graphs and the Personal PageRank algorithm automatically divides the synonyms into subgroups and ranks them according to the degree of semantic relatedness to the keyword, as well as their frequency in language use. For the creation methodology, see Krek et al. (2017) in the provided references.
The database includes dictionary entries: single- and multiword headwords, their part-of-speech and other linguistic features, as well as automatically extracted synonyms, their type (core or near) and relevancy rank. In version 2.0, 4,544 manually revised antonyms were added to the database. Additionally, for a part of the database, synonyms were distributed under the corresponding word senses. Pertaining to how much lexicographic revision was involved in their preparation, database entries can have one of the following three statuses: (a) ssss-automatic (96,064 entries): no manual revision was conducted; (b) ssss-manual (3,421 entries): word senses and semantic indicators were prepared by lexicographers, and synonyms were manually distributed under each corresponding sense; (c) ssss-hybrid (1,352 entries): manually revised senses are combined with data compiled automatically. For novelties of v2.0, see Arhar Holdt et al. (2023) in the provided references.
2023-11-15T00:00:00ZParliamentary spoken corpus of Polish ParlaSpeech-PL 1.0Koržinek, DanijelLjubešić, Nikolahttp://hdl.handle.net/11356/16862024-02-02T10:51:10Z2024-02-01T00:00:00ZParliamentary spoken corpus of Polish ParlaSpeech-PL 1.0
Koržinek, Danijel; Ljubešić, Nikola
The ParlaSpeech-PL dataset is built from the transcripts of parliamentary proceedings available in the Polish part of the ParlaMint corpus, and the parliamentary recordings available from the Polish Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.
2024-02-01T00:00:00ZParliamentary spoken corpus of Croatian ParlaSpeech-HR 2.0Ljubešić, NikolaKoržinek, DanijelRupnik, Peterhttp://hdl.handle.net/11356/19142024-02-07T19:00:25Z2024-01-25T00:00:00ZParliamentary spoken corpus of Croatian ParlaSpeech-HR 2.0
Ljubešić, Nikola; Koržinek, Danijel; Rupnik, Peter
The ParlaSpeech-HR dataset is built from the transcripts of parliamentary proceedings available in the Croatian part of the ParlaMint corpus, and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.
The main differences to the version 1.0 of the dataset are:
- larger size (ParlaMint 4.0 is used here, while previously ParlaMint 2.1 was used)
- improved matching pipeline
- segments based on linguistically sound sentences from the ParlaMint transcripts, while previously segments surrounded with silence were used
2024-01-25T00:00:00Z