Parallel sense-annotated corpus ELEXIS-WSD 2.0

Name: Parallel sense-annotated corpus ELEXIS-WSD 2.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Čibej, Jaka; Krek, Simon; Tiberius, Carole; Martelli, Federico; Navigli, Roberto; Kallas, Jelena; Gantar, Polona; Koeva, Svetla; Nimb, Sanni; Sandford Pedersen, Bolette; Olsen, Sussi; Langemets, Margit; Koppel, Kristina; Üksik, Tiiu; Dobrovoljc, Kaja; Ureña-Ruiz, Rafael; Sancho-Sánchez, José-Luis; Lipp, Veronika; Váradi, Tamás; Győrffy, András; Simon, László; Quochi, Valeria; Monachini, Monica; Frontini, Francesca; Tempelaars, Rob; Costa, Rute; Salgado, Ana; Munda, Tina; Kosem, Iztok; Roblek, Rebeka; Kamenšek, Urška; Zaranšek, Petra; Zgaga, Karolina; Ponikvar, Primož; Terčon, Luka; Jensen, Jonas; Flörke, Ida; Lorentzen, Henrik; Troelsgård, Thomas; Blagoeva, Diana; Hristov, Dimitar; Kolkovska, Sia; Muischnek, Kadri; Saul, Kertu; Jõgi, Karoliina; Bon, Mija; Stanković, Ranka; Krstev, Cvetana; Marković, Aleksandra; Ikonić Nešić, Milica; Giouli, Voula; Papanikolaou, Eri; Lobzhanidze, Irina; Barbu Mititelu, Verginica; Popa, Simina; Cristiana, Lea; Catalin, Mihaila; Irimia, Elena; Ostroški Anić, Ana; Runjaić, Siniša; Sviben, Robert; Pavić, Martina; Filipović Petrović, Ivana; Alberski, Bartłomiej; Cvetkoski, Vladimir; Kanishcheva, Olha; Makhachashvili, Rusudan

dc.contributor.author	Čibej, Jaka
dc.contributor.author	Krek, Simon
dc.contributor.author	Tiberius, Carole
dc.contributor.author	Martelli, Federico
dc.contributor.author	Navigli, Roberto
dc.contributor.author	Kallas, Jelena
dc.contributor.author	Gantar, Polona
dc.contributor.author	Koeva, Svetla
dc.contributor.author	Nimb, Sanni
dc.contributor.author	Sandford Pedersen, Bolette
dc.contributor.author	Olsen, Sussi
dc.contributor.author	Langemets, Margit
dc.contributor.author	Koppel, Kristina
dc.contributor.author	Üksik, Tiiu
dc.contributor.author	Dobrovoljc, Kaja
dc.contributor.author	Ureña-Ruiz, Rafael
dc.contributor.author	Sancho-Sánchez, José-Luis
dc.contributor.author	Lipp, Veronika
dc.contributor.author	Váradi, Tamás
dc.contributor.author	Győrffy, András
dc.contributor.author	Simon, László
dc.contributor.author	Quochi, Valeria
dc.contributor.author	Monachini, Monica
dc.contributor.author	Frontini, Francesca
dc.contributor.author	Tempelaars, Rob
dc.contributor.author	Costa, Rute
dc.contributor.author	Salgado, Ana
dc.contributor.author	Munda, Tina
dc.contributor.author	Kosem, Iztok
dc.contributor.author	Roblek, Rebeka
dc.contributor.author	Kamenšek, Urška
dc.contributor.author	Zaranšek, Petra
dc.contributor.author	Zgaga, Karolina
dc.contributor.author	Ponikvar, Primož
dc.contributor.author	Terčon, Luka
dc.contributor.author	Jensen, Jonas
dc.contributor.author	Flörke, Ida
dc.contributor.author	Lorentzen, Henrik
dc.contributor.author	Troelsgård, Thomas
dc.contributor.author	Blagoeva, Diana
dc.contributor.author	Hristov, Dimitar
dc.contributor.author	Kolkovska, Sia
dc.contributor.author	Muischnek, Kadri
dc.contributor.author	Saul, Kertu
dc.contributor.author	Jõgi, Karoliina
dc.contributor.author	Bon, Mija
dc.contributor.author	Stanković, Ranka
dc.contributor.author	Krstev, Cvetana
dc.contributor.author	Marković, Aleksandra
dc.contributor.author	Ikonić Nešić, Milica
dc.contributor.author	Giouli, Voula
dc.contributor.author	Papanikolaou, Eri
dc.contributor.author	Lobzhanidze, Irina
dc.contributor.author	Barbu Mititelu, Verginica
dc.contributor.author	Popa, Simina
dc.contributor.author	Cristiana, Lea
dc.contributor.author	Catalin, Mihaila
dc.contributor.author	Irimia, Elena
dc.contributor.author	Ostroški Anić, Ana
dc.contributor.author	Runjaić, Siniša
dc.contributor.author	Sviben, Robert
dc.contributor.author	Pavić, Martina
dc.contributor.author	Filipović Petrović, Ivana
dc.contributor.author	Alberski, Bartłomiej
dc.contributor.author	Cvetkoski, Vladimir
dc.contributor.author	Kanishcheva, Olha
dc.contributor.author	Makhachashvili, Rusudan
dc.date.accessioned	2026-05-14T15:32:07Z
dc.date.available	2026-05-14T15:32:07Z
dc.date.issued	2026-04-01
dc.identifier.uri	http://hdl.handle.net/11356/2101
dc.description	ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 2.0 contains subcorpora with sentences for 17 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, Slovene, Serbian, Croatian, Macedonian, Greek, Romanian, Georgian, and Polish. In addition, it contains manually corrected translations for Ukrainian - these will be processed in future versions. In 2.0, not all corpora cover all annotation layers - a more detailed overview is available in 00README.txt. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfactory semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with UPOS tags using UDPipe (https://lindat.mff.cuni.cz/services/udpipe/ - see 00README.txt for information on specific models). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. Dependency relations were added with UDPipe 2.15 in version 1.2 and manually validated for Slovene, Georgian, Romanian, and Estonian. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene SR: Serbian WordNet The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its XPOS-tag (if available), its morphological features (FEATS), the head of the dependency relation (HEAD), the type of dependency relation (DEPREL); the ninth column (DEPS) is empty; the final MISC column contains the following: the token's whitespace information (whether the token is followed by a whitespace or not; e.g. SpaceAfter=No), the ID of the sense assigned to the token, the index of the multiword expression (if the token is part of an annotated multiword expression), and the index and type of the named entity annotation (currently only available in elexis-wsd-sl and elexis-wsd-en). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt. Updates in version 2.0: - Subcorpora for 7 new languages (Serbian, Croatian, Macedonian, Greek, Romanian, Polish, Georgian) and translations for Ukrainian were added. - Sense annotations for ELEXIS-WSD-sl were updated. Additional multiword expression annotations were added according to the PARSEME 2.0 guidelines (see 00README.txt).
dc.language.iso	slv
dc.language.iso	eng
dc.language.iso	bul
dc.language.iso	dan
dc.language.iso	por
dc.language.iso	ita
dc.language.iso	spa
dc.language.iso	hun
dc.language.iso	est
dc.language.iso	nld
dc.language.iso	srp
dc.language.iso	hrv
dc.language.iso	mkd
dc.language.iso	ell
dc.language.iso	ukr
dc.language.iso	pol
dc.language.iso	kat
dc.language.iso	ron
dc.publisher	Jožef Stefan Institute
dc.relation	info:eu-repo/grantAgreement/EC/H2020/731015
dc.relation.isreferencedby	https://elex.link/elex2021/wp-content/uploads/2021/08/eLex_2021_22_pp377-395.pdf
dc.relation.replaces	http://hdl.handle.net/11356/2029
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://elex.is/
dc.subject	word sense disambiguation
dc.subject	parallel corpus
dc.subject	sense annotation
dc.subject	multilingual
dc.title	Parallel sense-annotated corpus ELEXIS-WSD 2.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Jaka Čibej jaka.cibej@ijs.si Jožef Stefan Institute
contact.person	Jaka Čibej jaka.cibej@ff.uni-lj.si Faculty of Arts, University of Ljubljana
sponsor	European Union EC/H2020/731015 ELEXIS - European Lexicographic Infrastructure euFunds info:eu-repo/grantAgreement/EC/H2020/731015
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	COST CA21167 Universality, Diversity and Idiosyncrasy in Language Technology (UniDive) Other
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info	36432 sentences
size.info	28 files
size.info	572586 tokens
files.count	1
files.size	14760949

Datoteke v tem vnosu

To je vnos

Publicly Available

z licenco:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Ime: elexis-wsd-2.0.zip
Velikost: 14.08 MB
Format: application/zip
Opis: ELEXIS-WSD 2.0 (CoNLL-U)
MD5: 384324e3b5f66090d8056b1a5c6d858a

Prenesi datoteko Predogled

Predogled datoteke

elexis-wsd-2.0
- potential_errors
  - elexis-wsd-da_corpus_potential-errors.txt369 kB
  - elexis-wsd-bg_corpus_potential-errors.txt675 kB
  - elexis-wsd-nl_corpus_potential-errors.txt136 kB
  - elexis-wsd-en_corpus_potential-errors.txt160 kB
  - elexis-wsd-it_corpus_potential-errors.txt198 kB
  - elexis-wsd-et_corpus_potential-errors.txt441 kB
  - elexis-wsd-hu_corpus_potential-errors.txt224 kB
  - elexis-wsd-es_corpus_potential-errors.txt391 kB
  - elexis-wsd-sr_corpus_potential-errors.txt241 kB
  - elexis-wsd-pt_corpus_potential-errors.txt449 kB
- translations
  - elexis-wsd-uk_translations.tsv538 kB
- corpora
  - elexis-wsd-en_corpus.conllu3 MB
  - elexis-wsd-pl_corpus.conllu1 MB
  - elexis-wsd-ro_corpus.conllu2 MB
  - elexis-wsd-bg_corpus.conllu3 MB
  - elexis-wsd-et_corpus.conllu2 MB
  - elexis-wsd-el_corpus.conllu1 MB
  - elexis-wsd-sl_corpus.conllu2 MB
  - elexis-wsd-sr_corpus.conllu2 MB
  - elexis-wsd-da_corpus.conllu3 MB
  - elexis-wsd-it_corpus.conllu3 MB
  - elexis-wsd-mk_corpus.conllu1 MB
  - elexis-wsd-hr_corpus.conllu1 MB
  - elexis-wsd-ka_corpus.conllu3 MB
  - elexis-wsd-es_corpus.conllu3 MB
  - elexis-wsd-hu_corpus.conllu3 MB
  - elexis-wsd-pt_corpus.conllu3 MB
  - elexis-wsd-nl_corpus.conllu3 MB
- 00README.txt19 kB
- sense_inventories
  - elexis-wsd-sl_sense_inventory.tsv666 kB
  - elexis-wsd-nl_sense-inventory.tsv1 MB
  - elexis-wsd-en_sense-inventory.tsv1 MB
  - elexis-wsd-sr_sense_inventory.tsv1 MB
  - elexis-wsd-bg_sense-inventory.tsv1 MB
  - elexis-wsd-it_sense-inventory.tsv1 MB
  - elexis-wsd-et_sense-inventory.tsv1 MB
  - elexis-wsd-hu_sense-inventory.tsv4 MB
  - elexis-wsd-pt_sense-inventory.tsv2 MB
  - elexis-wsd-es_sense-inventory.tsv1 MB
  - elexis-wsd-da_sense-inventory.tsv1008 kB

Prikaži enostavni zapis vnosa

Datoteke v tem vnosu

Partnerji

Partnerji

Repozitorij