Show simple item record

 
dc.contributor.author Kosem, Iztok
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Zgaga, Karolina
dc.contributor.author Šešet, Jure
dc.contributor.author Kamenšek, Urška
dc.contributor.author Zaranšek, Petra
dc.contributor.author Ponikvar, Primož
dc.contributor.author Arčon, Tjaša
dc.date.accessioned 2025-11-11T16:02:30Z
dc.date.available 2025-11-11T16:02:30Z
dc.date.issued 2025-11-10
dc.identifier.uri http://hdl.handle.net/11356/2056
dc.description The dataset contains 51,023 headword-synonym-distractor triplets for 5,000 headwords. Distractor is defined as an incorrect answer/alternative to synonym, which can be similar to synonym in meaning and/or form. Headwords and their synonyms were obtained from the Thesaurus of Modern Slovene (http://hdl.handle.net/11356/1916), which is part of the Dictionary Database of Slovene (the database is available via API: https://wiki.cjvt.si/books/digital-dictionary-database-of-slovene). The criteria for selecting the headwords (nouns, adjectives, verbs, and adverbs) were that they had to be frequent and had to have several synonyms, preferably more than five. The distractors were obtained with the Gemini-2.0-flash (https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-0-flash) model, using the following prompt: "You are given headword and a synonym. Create a distractor — a word that looks similar to the synonym but has a different meaning. The distractor must be the same part of speech as the synonym (e.g., if the synonyms are verbs in their base form, the distractor must also be a verb in its base form). The distractor must not include sensitive vocabulary (e.g., words related to minorities, religion, sexual content, violence, etc.). The distractor must be a frequent word in the Slovene language. The distractor must look similar to the synonym but have a different meaning. Write the distractor in the same line as the headword and synonym, following this format: živahen - vesel - resen. These are the headword and synonym: {word} - {synonym} The distractor cannot be one of these words: {synonym_set}." The manual evaluation of all the distractors (with the exception of the distractors that were identified as existing synonyms in the Thesaurus) was conducted by two lexicographers. Each of them evaluted their own part, with the second one also subsequently inspecting the evaluations of the first one. The estimate is that around 30-35% of data was evaluated by both lexicographers. Five decisions were used: good distractor, bad distractor, problematic (i.e. difficult to decide due to certain characteristic such as being too similar to synonym, word being too archaic or informal etc.), same as synonym, and synonym candidate (likely being a legitimate (new) synonym of the headword). The dataset also includes the information on the frequency of synonyms and the distractors in the Gigafida 2.0 reference corpus of Slovene (http://hdl.handle.net/11356/1320). The frequency information is provided for single-word lemmas only (and not for multiword items, non-lemma single-word forms such as plural form of nouns or comparatives of adjectives). In addition, the information on similarity between the headwords and synonyms, and between the synonyms and distractors is provided. Similary is calculated using Gestalt pattern matching.
dc.language.iso slv
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://www.cjvt.si/llm4dh/en/
dc.subject synonyms
dc.subject distractors
dc.subject large language models
dc.subject manual annotation
dc.title Dataset of annotated headword-synonym-distractor triplets SYNDIST
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType other
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Iztok Kosem iztok.kosem@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor Public Agency for Scientific Research and Innovation of the Republic of Slovenia GC-0002 Large Language Models for Digital Humanities (LLM4DH) nationalFunds
sponsor University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds
sponsor Ministry of Culture of the Republic of Slovenia JR-infrastruktura-SJ-2024-2025 Data completion and gamification of dictionary resources at CJVT UL (PODVIG) nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 51023 entries
files.count 1
files.size 849766


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
SYNDIST.zip
Size
829.85 KB
Format
application/zip
Description
dataset + Readme file
MD5
5c7069d6668158b399b7fd29bdd56f16
 Download file  Preview
 File Preview  
    • Readme.txt-1 B
    • Synonyms-distractors-SYNDIST.tsv-1 B

Show simple item record