Show simple item record

 
dc.contributor.author Čibej, Jaka
dc.contributor.author Krek, Simon
dc.contributor.author Tiberius, Carole
dc.contributor.author Martelli, Federico
dc.contributor.author Navigli, Roberto
dc.contributor.author Kallas, Jelena
dc.contributor.author Gantar, Polona
dc.contributor.author Koeva, Svetla
dc.contributor.author Nimb, Sanni
dc.contributor.author Sandford Pedersen, Bolette
dc.contributor.author Olsen, Sussi
dc.contributor.author Langemets, Margit
dc.contributor.author Koppel, Kristina
dc.contributor.author Üksik, Tiiu
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Ureña-Ruiz, Rafael
dc.contributor.author Sancho-Sánchez, José-Luis
dc.contributor.author Lipp, Veronika
dc.contributor.author Váradi, Tamás
dc.contributor.author Győrffy, András
dc.contributor.author Simon, László
dc.contributor.author Quochi, Valeria
dc.contributor.author Monachini, Monica
dc.contributor.author Frontini, Francesca
dc.contributor.author Tempelaars, Rob
dc.contributor.author Costa, Rute
dc.contributor.author Salgado, Ana
dc.contributor.author Munda, Tina
dc.contributor.author Kosem, Iztok
dc.contributor.author Roblek, Rebeka
dc.contributor.author Kamenšek, Urška
dc.contributor.author Zaranšek, Petra
dc.contributor.author Zgaga, Karolina
dc.contributor.author Ponikvar, Primož
dc.contributor.author Terčon, Luka
dc.contributor.author Jensen, Jonas
dc.contributor.author Flörke, Ida
dc.contributor.author Lorentzen, Henrik
dc.contributor.author Troelsgård, Thomas
dc.contributor.author Blagoeva, Diana
dc.contributor.author Hristov, Dimitar
dc.contributor.author Kolkovska, Sia
dc.contributor.author Muischnek, Kadri
dc.contributor.author Saul, Kertu
dc.contributor.author Jõgi, Karoliina
dc.contributor.author Bon, Mija
dc.contributor.author Stanković, Ranka
dc.contributor.author Krstev, Cvetana
dc.contributor.author Marković, Aleksandra
dc.contributor.author Ikonić Nešić, Milica
dc.contributor.author Giouli, Voula
dc.contributor.author Papanikolaou, Eri
dc.contributor.author Lobzhanidze, Irina
dc.contributor.author Barbu Mititelu, Verginica
dc.contributor.author Popa, Simina
dc.contributor.author Cristiana, Lea
dc.contributor.author Catalin, Mihaila
dc.contributor.author Irimia, Elena
dc.contributor.author Ostroški Anić, Ana
dc.contributor.author Runjaić, Siniša
dc.contributor.author Sviben, Robert
dc.contributor.author Pavić, Martina
dc.contributor.author Filipović Petrović, Ivana
dc.contributor.author Alberski, Bartłomiej
dc.contributor.author Cvetkoski, Vladimir
dc.contributor.author Kanishcheva, Olha
dc.contributor.author Makhachashvili, Rusudan
dc.date.accessioned 2026-05-14T15:32:07Z
dc.date.available 2026-05-14T15:32:07Z
dc.date.issued 2026-04-01
dc.identifier.uri http://hdl.handle.net/11356/2101
dc.description ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 2.0 contains subcorpora with sentences for 17 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, Slovene, Serbian, Croatian, Macedonian, Greek, Romanian, Georgian, and Polish. In addition, it contains manually corrected translations for Ukrainian - these will be processed in future versions. In 2.0, not all corpora cover all annotation layers - a more detailed overview is available in 00README.txt. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfactory semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with UPOS tags using UDPipe (https://lindat.mff.cuni.cz/services/udpipe/ - see 00README.txt for information on specific models). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. Dependency relations were added with UDPipe 2.15 in version 1.2 and manually validated for Slovene, Georgian, Romanian, and Estonian. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene SR: Serbian WordNet The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its XPOS-tag (if available), its morphological features (FEATS), the head of the dependency relation (HEAD), the type of dependency relation (DEPREL); the ninth column (DEPS) is empty; the final MISC column contains the following: the token's whitespace information (whether the token is followed by a whitespace or not; e.g. SpaceAfter=No), the ID of the sense assigned to the token, the index of the multiword expression (if the token is part of an annotated multiword expression), and the index and type of the named entity annotation (currently only available in elexis-wsd-sl and elexis-wsd-en). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt. Updates in version 2.0: - Subcorpora for 7 new languages (Serbian, Croatian, Macedonian, Greek, Romanian, Polish, Georgian) and translations for Ukrainian were added. - Sense annotations for ELEXIS-WSD-sl were updated. Additional multiword expression annotations were added according to the PARSEME 2.0 guidelines (see 00README.txt).
dc.language.iso slv
dc.language.iso eng
dc.language.iso bul
dc.language.iso dan
dc.language.iso por
dc.language.iso ita
dc.language.iso spa
dc.language.iso hun
dc.language.iso est
dc.language.iso nld
dc.language.iso srp
dc.language.iso hrv
dc.language.iso mkd
dc.language.iso ell
dc.language.iso ukr
dc.language.iso pol
dc.language.iso kat
dc.language.iso ron
dc.publisher Jožef Stefan Institute
dc.relation info:eu-repo/grantAgreement/EC/H2020/731015
dc.relation.isreferencedby https://elex.link/elex2021/wp-content/uploads/2021/08/eLex_2021_22_pp377-395.pdf
dc.relation.replaces http://hdl.handle.net/11356/2029
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://elex.is/
dc.subject word sense disambiguation
dc.subject parallel corpus
dc.subject sense annotation
dc.subject multilingual
dc.title Parallel sense-annotated corpus ELEXIS-WSD 2.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Jaka Čibej jaka.cibej@ijs.si Jožef Stefan Institute
contact.person Jaka Čibej jaka.cibej@ff.uni-lj.si Faculty of Arts, University of Ljubljana
sponsor European Union EC/H2020/731015 ELEXIS - European Lexicographic Infrastructure euFunds info:eu-repo/grantAgreement/EC/H2020/731015
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor COST CA21167 Universality, Diversity and Idiosyncrasy in Language Technology (UniDive) Other
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 36432 sentences
size.info 28 files
size.info 572586 tokens
files.count 1
files.size 14760949


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
elexis-wsd-2.0.zip
Size
14.08 MB
Format
application/zip
Description
ELEXIS-WSD 2.0 (CoNLL-U)
MD5
384324e3b5f66090d8056b1a5c6d858a
 Download file  Preview
 File Preview  
  • elexis-wsd-2.0
    • potential_errors
      • elexis-wsd-da_corpus_potential-errors.txt369 kB
      • elexis-wsd-bg_corpus_potential-errors.txt675 kB
      • elexis-wsd-nl_corpus_potential-errors.txt136 kB
      • elexis-wsd-en_corpus_potential-errors.txt160 kB
      • elexis-wsd-it_corpus_potential-errors.txt198 kB
      • elexis-wsd-et_corpus_potential-errors.txt441 kB
      • elexis-wsd-hu_corpus_potential-errors.txt224 kB
      • elexis-wsd-es_corpus_potential-errors.txt391 kB
      • elexis-wsd-sr_corpus_potential-errors.txt241 kB
      • elexis-wsd-pt_corpus_potential-errors.txt449 kB
    • translations
      • elexis-wsd-uk_translations.tsv538 kB
    • corpora
      • elexis-wsd-en_corpus.conllu3 MB
      • elexis-wsd-pl_corpus.conllu1 MB
      • elexis-wsd-ro_corpus.conllu2 MB
      • elexis-wsd-bg_corpus.conllu3 MB
      • elexis-wsd-et_corpus.conllu2 MB
      • elexis-wsd-el_corpus.conllu1 MB
      • elexis-wsd-sl_corpus.conllu2 MB
      • elexis-wsd-sr_corpus.conllu2 MB
      • elexis-wsd-da_corpus.conllu3 MB
      • elexis-wsd-it_corpus.conllu3 MB
      • elexis-wsd-mk_corpus.conllu1 MB
      • elexis-wsd-hr_corpus.conllu1 MB
      • elexis-wsd-ka_corpus.conllu3 MB
      • elexis-wsd-es_corpus.conllu3 MB
      • elexis-wsd-hu_corpus.conllu3 MB
      • elexis-wsd-pt_corpus.conllu3 MB
      • elexis-wsd-nl_corpus.conllu3 MB
    • 00README.txt19 kB
    • sense_inventories
      • elexis-wsd-sl_sense_inventory.tsv666 kB
      • elexis-wsd-nl_sense-inventory.tsv1 MB
      • elexis-wsd-en_sense-inventory.tsv1 MB
      • elexis-wsd-sr_sense_inventory.tsv1 MB
      • elexis-wsd-bg_sense-inventory.tsv1 MB
      • elexis-wsd-it_sense-inventory.tsv1 MB
      • elexis-wsd-et_sense-inventory.tsv1 MB
      • elexis-wsd-hu_sense-inventory.tsv4 MB
      • elexis-wsd-pt_sense-inventory.tsv2 MB
      • elexis-wsd-es_sense-inventory.tsv1 MB
      • elexis-wsd-da_sense-inventory.tsv1008 kB

Show simple item record