| dc.contributor.author | Kosem, Iztok |
| dc.contributor.author | Arhar Holdt, Špela |
| dc.contributor.author | Zgaga, Karolina |
| dc.contributor.author | Šešet, Jure |
| dc.contributor.author | Kamenšek, Urška |
| dc.contributor.author | Zaranšek, Petra |
| dc.contributor.author | Ponikvar, Primož |
| dc.contributor.author | Arčon, Tjaša |
| dc.date.accessioned | 2025-11-11T16:02:30Z |
| dc.date.available | 2025-11-11T16:02:30Z |
| dc.date.issued | 2025-11-10 |
| dc.identifier.uri | http://hdl.handle.net/11356/2056 |
| dc.description | The dataset contains 51,023 headword-synonym-distractor triplets for 5,000 headwords. Distractor is defined as an incorrect answer/alternative to synonym, which can be similar to synonym in meaning and/or form. Headwords and their synonyms were obtained from the Thesaurus of Modern Slovene (http://hdl.handle.net/11356/1916), which is part of the Dictionary Database of Slovene (the database is available via API: https://wiki.cjvt.si/books/digital-dictionary-database-of-slovene). The criteria for selecting the headwords (nouns, adjectives, verbs, and adverbs) were that they had to be frequent and had to have several synonyms, preferably more than five. The distractors were obtained with the Gemini-2.0-flash (https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-0-flash) model, using the following prompt: "You are given headword and a synonym. Create a distractor — a word that looks similar to the synonym but has a different meaning. The distractor must be the same part of speech as the synonym (e.g., if the synonyms are verbs in their base form, the distractor must also be a verb in its base form). The distractor must not include sensitive vocabulary (e.g., words related to minorities, religion, sexual content, violence, etc.). The distractor must be a frequent word in the Slovene language. The distractor must look similar to the synonym but have a different meaning. Write the distractor in the same line as the headword and synonym, following this format: živahen - vesel - resen. These are the headword and synonym: {word} - {synonym} The distractor cannot be one of these words: {synonym_set}." The manual evaluation of all the distractors (with the exception of the distractors that were identified as existing synonyms in the Thesaurus) was conducted by two lexicographers. Each of them evaluted their own part, with the second one also subsequently inspecting the evaluations of the first one. The estimate is that around 30-35% of data was evaluated by both lexicographers. Five decisions were used: good distractor, bad distractor, problematic (i.e. difficult to decide due to certain characteristic such as being too similar to synonym, word being too archaic or informal etc.), same as synonym, and synonym candidate (likely being a legitimate (new) synonym of the headword). The dataset also includes the information on the frequency of synonyms and the distractors in the Gigafida 2.0 reference corpus of Slovene (http://hdl.handle.net/11356/1320). The frequency information is provided for single-word lemmas only (and not for multiword items, non-lemma single-word forms such as plural form of nouns or comparatives of adjectives). In addition, the information on similarity between the headwords and synonyms, and between the synonyms and distractors is provided. Similary is calculated using Gestalt pattern matching. |
| dc.language.iso | slv |
| dc.publisher | Faculty of Computer and Information Science, University of Ljubljana |
| dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
| dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
| dc.rights.label | PUB |
| dc.source.uri | https://www.cjvt.si/llm4dh/en/ |
| dc.subject | synonyms |
| dc.subject | distractors |
| dc.subject | large language models |
| dc.subject | manual annotation |
| dc.title | Dataset of annotated headword-synonym-distractor triplets SYNDIST |
| dc.type | lexicalConceptualResource |
| metashare.ResourceInfo#ContentInfo.detailedType | other |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Iztok Kosem iztok.kosem@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana |
| sponsor | Public Agency for Scientific Research and Innovation of the Republic of Slovenia GC-0002 Large Language Models for Digital Humanities (LLM4DH) nationalFunds |
| sponsor | University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds |
| sponsor | Ministry of Culture of the Republic of Slovenia JR-infrastruktura-SJ-2024-2025 Data completion and gamification of dictionary resources at CJVT UL (PODVIG) nationalFunds |
| sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
| sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
| size.info | 51023 entries |
| files.count | 1 |
| files.size | 849766 |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
z licenco:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Ime
- SYNDIST.zip
- Velikost
- 829.85 KB
- Format
- application/zip
- Opis
- dataset + Readme file
- MD5
- 5c7069d6668158b399b7fd29bdd56f16