Dataset of annotated headword-synonym-distractor triplets SYNDIST

Name: Dataset of annotated headword-synonym-distractor triplets SYNDIST
License: https://creativecommons.org/licenses/by/4.0/

Kosem, Iztok; Arhar Holdt, Špela; Zgaga, Karolina; Šešet, Jure; Kamenšek, Urška; Zaranšek, Petra; Ponikvar, Primož; Arčon, Tjaša

Prikaži enostavni zapis vnosa

dc.contributor.author	Kosem, Iztok
dc.contributor.author	Arhar Holdt, Špela
dc.contributor.author	Zgaga, Karolina
dc.contributor.author	Šešet, Jure
dc.contributor.author	Kamenšek, Urška
dc.contributor.author	Zaranšek, Petra
dc.contributor.author	Ponikvar, Primož
dc.contributor.author	Arčon, Tjaša
dc.date.accessioned	2025-11-11T16:02:30Z
dc.date.available	2025-11-11T16:02:30Z
dc.date.issued	2025-11-10
dc.identifier.uri	http://hdl.handle.net/11356/2056
dc.description	The dataset contains 51,023 headword-synonym-distractor triplets for 5,000 headwords. Distractor is defined as an incorrect answer/alternative to synonym, which can be similar to synonym in meaning and/or form. Headwords and their synonyms were obtained from the Thesaurus of Modern Slovene (http://hdl.handle.net/11356/1916), which is part of the Dictionary Database of Slovene (the database is available via API: https://wiki.cjvt.si/books/digital-dictionary-database-of-slovene). The criteria for selecting the headwords (nouns, adjectives, verbs, and adverbs) were that they had to be frequent and had to have several synonyms, preferably more than five. The distractors were obtained with the Gemini-2.0-flash (https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-0-flash) model, using the following prompt: "You are given headword and a synonym. Create a distractor — a word that looks similar to the synonym but has a different meaning. The distractor must be the same part of speech as the synonym (e.g., if the synonyms are verbs in their base form, the distractor must also be a verb in its base form). The distractor must not include sensitive vocabulary (e.g., words related to minorities, religion, sexual content, violence, etc.). The distractor must be a frequent word in the Slovene language. The distractor must look similar to the synonym but have a different meaning. Write the distractor in the same line as the headword and synonym, following this format: živahen - vesel - resen. These are the headword and synonym: {word} - {synonym} The distractor cannot be one of these words: {synonym_set}." The manual evaluation of all the distractors (with the exception of the distractors that were identified as existing synonyms in the Thesaurus) was conducted by two lexicographers. Each of them evaluted their own part, with the second one also subsequently inspecting the evaluations of the first one. The estimate is that around 30-35% of data was evaluated by both lexicographers. Five decisions were used: good distractor, bad distractor, problematic (i.e. difficult to decide due to certain characteristic such as being too similar to synonym, word being too archaic or informal etc.), same as synonym, and synonym candidate (likely being a legitimate (new) synonym of the headword). The dataset also includes the information on the frequency of synonyms and the distractors in the Gigafida 2.0 reference corpus of Slovene (http://hdl.handle.net/11356/1320). The frequency information is provided for single-word lemmas only (and not for multiword items, non-lemma single-word forms such as plural form of nouns or comparatives of adjectives). In addition, the information on similarity between the headwords and synonyms, and between the synonyms and distractors is provided. Similary is calculated using Gestalt pattern matching.
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby	https://elex.link/elex2025/wp-content/uploads/eLex2025-37-KosemArhar-Holdt.pdf
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/llm4dh/en/
dc.subject	synonyms
dc.subject	distractors
dc.subject	large language models
dc.subject	manual annotation
dc.title	Dataset of annotated headword-synonym-distractor triplets SYNDIST
dc.type	lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType	other
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Iztok Kosem iztok.kosem@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor	Public Agency for Scientific Research and Innovation of the Republic of Slovenia GC-0002 Large Language Models for Digital Humanities (LLM4DH) nationalFunds
sponsor	University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds
sponsor	Ministry of Culture of the Republic of Slovenia JR-infrastruktura-SJ-2024-2025 Data completion and gamification of dictionary resources at CJVT UL (PODVIG) nationalFunds
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	European Union HORIZON-WIDERA-2023-TALENTS-01-01 101186647 EU Era Chair (AI4DH) euFunds
size.info	51023 entries
files.count	1
files.size	849766