CLARIN.SI ELEXIS

CLARIN.SI ELEXIS http://hdl.handle.net/11356/1479 Dictionaries and lexicons of the European Lexicographic Infrastructure 2026-07-08T12:59:15Z Parallel sense-annotated corpus ELEXIS-WSD 2.0 http://hdl.handle.net/11356/2101 Parallel sense-annotated corpus ELEXIS-WSD 2.0 Čibej, Jaka; Krek, Simon; Tiberius, Carole; Martelli, Federico; Navigli, Roberto; Kallas, Jelena; Gantar, Polona; Koeva, Svetla; Nimb, Sanni; Sandford Pedersen, Bolette; Olsen, Sussi; Langemets, Margit; Koppel, Kristina; Üksik, Tiiu; Dobrovoljc, Kaja; Ureña-Ruiz, Rafael; Sancho-Sánchez, José-Luis; Lipp, Veronika; Váradi, Tamás; Győrffy, András; Simon, László; Quochi, Valeria; Monachini, Monica; Frontini, Francesca; Tempelaars, Rob; Costa, Rute; Salgado, Ana; Munda, Tina; Kosem, Iztok; Roblek, Rebeka; Kamenšek, Urška; Zaranšek, Petra; Zgaga, Karolina; Ponikvar, Primož; Terčon, Luka; Jensen, Jonas; Flörke, Ida; Lorentzen, Henrik; Troelsgård, Thomas; Blagoeva, Diana; Hristov, Dimitar; Kolkovska, Sia; Muischnek, Kadri; Saul, Kertu; Jõgi, Karoliina; Bon, Mija; Stanković, Ranka; Krstev, Cvetana; Marković, Aleksandra; Ikonić Nešić, Milica; Giouli, Voula; Papanikolaou, Eri; Lobzhanidze, Irina; Barbu Mititelu, Verginica; Popa, Simina; Cristiana, Lea; Catalin, Mihaila; Irimia, Elena; Ostroški Anić, Ana; Runjaić, Siniša; Sviben, Robert; Pavić, Martina; Filipović Petrović, Ivana; Alberski, Bartłomiej; Cvetkoski, Vladimir; Kanishcheva, Olha; Makhachashvili, Rusudan ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 2.0 contains subcorpora with sentences for 17 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, Slovene, Serbian, Croatian, Macedonian, Greek, Romanian, Georgian, and Polish. In addition, it contains manually corrected translations for Ukrainian - these will be processed in future versions. In 2.0, not all corpora cover all annotation layers - a more detailed overview is available in 00README.txt. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfactory semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with UPOS tags using UDPipe (https://lindat.mff.cuni.cz/services/udpipe/ - see 00README.txt for information on specific models). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. Dependency relations were added with UDPipe 2.15 in version 1.2 and manually validated for Slovene, Georgian, Romanian, and Estonian. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene SR: Serbian WordNet The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its XPOS-tag (if available), its morphological features (FEATS), the head of the dependency relation (HEAD), the type of dependency relation (DEPREL); the ninth column (DEPS) is empty; the final MISC column contains the following: the token's whitespace information (whether the token is followed by a whitespace or not; e.g. SpaceAfter=No), the ID of the sense assigned to the token, the index of the multiword expression (if the token is part of an annotated multiword expression), and the index and type of the named entity annotation (currently only available in elexis-wsd-sl and elexis-wsd-en). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt. Updates in version 2.0: - Subcorpora for 7 new languages (Serbian, Croatian, Macedonian, Greek, Romanian, Polish, Georgian) and translations for Ukrainian were added. - Sense annotations for ELEXIS-WSD-sl were updated. Additional multiword expression annotations were added according to the PARSEME 2.0 guidelines (see 00README.txt). 2026-04-01T00:00:00Z Sample from the e-book "Kitakov študent" (Kitak's student) http://hdl.handle.net/11356/2127 Sample from the e-book "Kitakov študent" (Kitak's student) Hartman Krajnc, Jana This entry includes the first part of the e-book "Kitakov študent" (Kitak's student) by author Jana Hartman Krajnc (COBISS.SI-ID 275100931, ISBN 978-961-7272-47-5). Ivan Kitak was a friend of General Maister. But he died too soon for anyone except his family to know about him today. He died at a time when General Maister was being silenced. Now no one remembers Ivan Kitak as Maister's close associate. The book "Kitakov študent" (Kitak's student) is a literary historical novel. 2026-04-25T00:00:00Z Sample from the audiobook "CAMINO – Poklon Junakom 3. nadstropja" (CAMINO – Gift to the Heroes of the 3rd Floor) http://hdl.handle.net/11356/2121 Sample from the audiobook "CAMINO – Poklon Junakom 3. nadstropja" (CAMINO – Gift to the Heroes of the 3rd Floor) Krepek, Anton Submission includes the first part of the audiobook "CAMINO – Poklon Junakom 3. nadstropja" (CAMINO – Gift to the Heroes of the 3rd Floor) by author Anton Krepek (COBISS.ID: 275243779, ISBN: 978-961-291-536-0). The book is a description of the Camino pilgrimage route, which the author Toni undertook as an act of gratitude after their child recovered from cancer and became one of the “Heroes of the 3rd Floor.” The book is a tribute to all of them. Despite the difficult ordeal, Toni manages to make readers laugh, find the good in things, and truly enjoy every single day of life. That’s why he infused the book with lightness and humor. 2026-04-20T00:00:00Z Parallel sense-annotated corpus ELEXIS-WSD 1.2 http://hdl.handle.net/11356/2022 Parallel sense-annotated corpus ELEXIS-WSD 1.2 Čibej, Jaka; Krek, Simon; Tiberius, Carole; Martelli, Federico; Navigli, Roberto; Kallas, Jelena; Gantar, Polona; Koeva, Svetla; Nimb, Sanni; Sandford Pedersen, Bolette; Olsen, Sussi; Langemets, Margit; Koppel, Kristina; Üksik, Tiiu; Dobrovoljc, Kaja; Ureña-Ruiz, Rafael; Sancho-Sánchez, José-Luis; Lipp, Veronika; Váradi, Tamás; Győrffy, András; Simon, László; Quochi, Valeria; Monachini, Monica; Frontini, Francesca; Tempelaars, Rob; Costa, Rute; Salgado, Ana; Munda, Tina; Kosem, Iztok; Roblek, Rebeka; Kamenšek, Urška; Zaranšek, Petra; Zgaga, Karolina; Ponikvar, Primož; Terčon, Luka; Jensen, Jonas; Flörke, Ida; Lorentzen, Henrik; Troelsgård, Thomas; Blagoeva, Diana; Hristov, Dimitar; Kolkovska, Sia ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.2 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfactory semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with UPOS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. Dependency relations were added with UDPipe 2.15 in version 1.2. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its XPOS-tag (if available), its morphological features (FEATS), the head of the dependency relation (HEAD), the type of dependency relation (DEPREL); the ninth column (DEPS) is empty; the final MISC column contains the following: the token's whitespace information (whether the token is followed by a whitespace or not; e.g. SpaceAfter=No), the ID of the sense assigned to the token, the index of the multiword expression (if the token is part of an annotated multiword expression), and the index and type of the named entity annotation (currently only available in elexis-wsd-sl). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt. Updates in version 1.2: - Several tokenization errors with multiword tokens were fixed in all subcorpora (e.g. the order of subtokens was incorrect in many cases; the issue has now been resolved). - XPOS, FEATS, HEAD, and DEPREL columns were added automatically with UDPipe (except for elexis-wsd-sl and elexis-wsd-et; for Slovene, all columns were manually validated; for Estonian, HEAD and DEPREL were manually validated; all other languages contain automatic tags in these columns – for more information on the models used and their performance, see 00README.txt). - The entry now includes lists of potential errors in automatically assigned XPOS and FEATS values. In previous versions, only UPOS tags were manually annotated, while the XPOS and FEATS columns were left empty. XPOS and FEATS have now been added automatically through UDPipe. The list of potential errors contains the list of lines in the corpus in which the XPOS and FEATS columns are potentially incorrect because the manually validated UPOS tag differs from the automatically assigned UPOS tag, which indicates that the automatically assigned XPOS and FEATS columns are probably incorrect. This is meant as a reference for future validation efforts. - For Slovene, named entity annotations were added based on the annotations from the SUK 1.1 Training Corpus of Slovene (http://hdl.handle.net/11356/1959). 2025-04-04T00:00:00Z Parallel sense-annotated corpus ELEXIS-WSD 1.1 http://hdl.handle.net/11356/1842 Parallel sense-annotated corpus ELEXIS-WSD 1.1 Martelli, Federico; Navigli, Roberto; Krek, Simon; Kallas, Jelena; Gantar, Polona; Koeva, Svetla; Nimb, Sanni; Sandford Pedersen, Bolette; Olsen, Sussi; Langemets, Margit; Koppel, Kristina; Üksik, Tiiu; Dobrovoljc, Kaja; Ureña-Ruiz, Rafael; Sancho-Sánchez, José-Luis; Lipp, Veronika; Váradi, Tamás; Győrffy, András; Simon, László; Quochi, Valeria; Monachini, Monica; Frontini, Francesca; Tiberius, Carole; Tempelaars, Rob; Costa, Rute; Salgado, Ana; Čibej, Jaka; Munda, Tina; Kosem, Iztok; Roblek, Rebeka; Kamenšek, Urška; Zaranšek, Petra; Zgaga, Karolina; Ponikvar, Primož; Terčon, Luka; Jensen, Jonas; Flörke, Ida; Lorentzen, Henrik; Troelsgård, Thomas; Blagoeva, Diana; Hristov, Dimitar; Kolkovska, Sia ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.1 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, five empty columns (reserved for e.g. dependency parsing, which is absent from this version), and the final MISC column containing the following: the token's whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt. Differences to version 1.0: - Several minor errors were fixed (e.g. a typo in one of the Slovene sense IDs). - The corpus was converted to the true CoNLL-U format (as opposed to the CoNLL-U-like format used in v1.0). - An error was fixed that resulted in missing UPOS tags in version 1.0. - The sentences in all corpora now follow the same order (from 1 to 2024). 2023-05-22T00:00:00Z Serbian-English Terminology in the Power Engineering Domain - SrpEngPE (ELEXIS) http://hdl.handle.net/11356/1676 Serbian-English Terminology in the Power Engineering Domain - SrpEngPE (ELEXIS) Stanković, Ranka; Krstev, Cvetana; Ivanović, Tanja SrpEngPE - Serbian-English dictionary with terminology in the power engineering domain: automatically extracted from domain parallel corpus http://jerteh.rs/biblisha/ListaDokumenata.aspx?JCID=13&lng=en , extracted monolingual term candidates and translation pairs were manually evaluated and post-edited. 2022-07-26T00:00:00Z Bilingual List of German-Serbian Translated Pairs of Lexical Units - SrpNemLex (ELEXIS) http://hdl.handle.net/11356/1677 Bilingual List of German-Serbian Translated Pairs of Lexical Units - SrpNemLex (ELEXIS) Stanković, Ranka; Krstev, Cvetana; Andonovski, Jelena SrpNemLex - Bilingual list of German-Serbian translated pairs of lexical units: automatically extracted from parallel corpus that contains 14 novels http://jerteh.rs/biblisha/ListaDokumenata.aspx?JCID=11&lng=en , extracted monolingual candidates and translation pairs were manually evaluated and post-edited. 2022-07-26T00:00:00Z Basque Lexical Data in Wikidata (ELEXIS) http://hdl.handle.net/11356/1675 Basque Lexical Data in Wikidata (ELEXIS) Contributors of Wikidata This dataset contains Basque lemma-sense pairs with a POS tag, and definitions, extracted from Wikidata using this query: https://w.wiki/qWH . Other RDF statements related to Basque lexemes can be retrieved, such as links from lexeme sense to Wikidata concept. 2020-12-14T00:00:00Z Dictionary of Kosovo-Metohija Dialect (ELEXIS) http://hdl.handle.net/11356/1669 Dictionary of Kosovo-Metohija Dialect (ELEXIS) Elezović, Gliša Речник косовско-метохиског дијалекта (Dictionary of the Dialect of Kosovo and Metohija) by Gliša Elezović was originally published in two volumes in 1932 and 1935 in Srpski dijalektološki zbornik (Serbian Dialectological Journal) based on the sources collected between 1902 and 1928. 2022-07-01T00:00:00Z Dictionary of Southern Serbian Dialects (ELEXIS) http://hdl.handle.net/11356/1670 Dictionary of Southern Serbian Dialects (ELEXIS) Zlatanović, Momčilo Речник говора јужне Србије (Dictionary of Southern Serbian Dialects) is a dialectologica dictionary by Momčilo Zlatanović, which was first published in 1998. 2022-07-01T00:00:00Z