<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href='static/style.xsl' type='text/xsl'?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-04-03T23:25:15Z</responseDate><request verb="GetRecord" identifier="oai:www.clarin.si:11356/2056" metadataPrefix="oai_dc">http://www.clarin.si/repository/oai/request</request><GetRecord><record><header><identifier>oai:www.clarin.si:11356/2056</identifier><datestamp>2025-12-23T17:04:46Z</datestamp><setSpec>hdl_11356_1023</setSpec><setSpec>hdl_11356_1024</setSpec></header><metadata><oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Dataset of annotated headword-synonym-distractor triplets SYNDIST</dc:title>
<dc:creator>Kosem, Iztok</dc:creator>
<dc:creator>Arhar Holdt, Špela</dc:creator>
<dc:creator>Zgaga, Karolina</dc:creator>
<dc:creator>Šešet, Jure</dc:creator>
<dc:creator>Kamenšek, Urška</dc:creator>
<dc:creator>Zaranšek, Petra</dc:creator>
<dc:creator>Ponikvar, Primož</dc:creator>
<dc:creator>Arčon, Tjaša</dc:creator>
<dc:subject>synonyms</dc:subject>
<dc:subject>distractors</dc:subject>
<dc:subject>large language models</dc:subject>
<dc:subject>manual annotation</dc:subject>
<dc:description>The dataset contains 51,023 headword-synonym-distractor triplets for 5,000 headwords. Distractor is defined as an incorrect answer/alternative to synonym, which can be similar to synonym in meaning and/or form. Headwords and their synonyms were obtained from the Thesaurus of Modern Slovene (http://hdl.handle.net/11356/1916), which is part of the Dictionary Database of Slovene (the database is available via API: https://wiki.cjvt.si/books/digital-dictionary-database-of-slovene). The criteria for selecting the headwords (nouns, adjectives, verbs, and adverbs) were that they had to be frequent and had to have several synonyms, preferably more than five.&#xd;
&#xd;
The distractors were obtained with the Gemini-2.0-flash (https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-0-flash) model, using the following prompt:&#xd;
"You are given headword and a synonym. Create a distractor — a word that looks similar to the synonym but has a different meaning.&#xd;
The distractor must be the same part of speech as the synonym (e.g., if the synonyms are verbs in their base form, the distractor must also be a verb in its base form).&#xd;
The distractor must not include sensitive vocabulary (e.g., words related to minorities, religion, sexual content, violence, etc.).&#xd;
The distractor must be a frequent word in the Slovene language.&#xd;
The distractor must look similar to the synonym but have a different meaning.&#xd;
Write the distractor in the same line as the headword and synonym, following this format: živahen - vesel - resen. These are the headword and synonym: {word} - {synonym}&#xd;
The distractor cannot be one of these words: {synonym_set}."&#xd;
&#xd;
The manual evaluation of all the distractors (with the exception of the distractors that were identified as existing synonyms in the Thesaurus) was conducted by two lexicographers. Each of them evaluted their own part, with the second one also subsequently inspecting the evaluations of the first one. The estimate is that around 30-35% of data was evaluated by both lexicographers. Five decisions were used: good distractor, bad distractor, problematic (i.e. difficult to decide due to certain characteristic such as being too similar to synonym, word being too archaic or informal etc.), same as synonym, and synonym candidate (likely being a legitimate (new) synonym of the headword).&#xd;
&#xd;
The dataset also includes the information on the frequency of synonyms and the distractors in the Gigafida 2.0 reference corpus of Slovene (http://hdl.handle.net/11356/1320). The frequency information is provided for single-word lemmas only (and not for multiword items, non-lemma single-word forms such as plural form of nouns or comparatives of adjectives). In addition, the information on similarity between the headwords and synonyms, and between the synonyms and distractors is provided. Similary is calculated using Gestalt pattern matching.</dc:description>
<dc:date>2025-11-10</dc:date>
<dc:type>lexicalConceptualResource</dc:type>
<dc:identifier>http://hdl.handle.net/11356/2056</dc:identifier>
<dc:language>slv</dc:language>
<dc:relation>https://elex.link/elex2025/wp-content/uploads/eLex2025-37-KosemArhar-Holdt.pdf</dc:relation>
<dc:rights>Creative Commons - Attribution 4.0 International (CC BY 4.0)</dc:rights>
<dc:rights>https://creativecommons.org/licenses/by/4.0/</dc:rights>
<dc:rights>PUB</dc:rights>
<dc:format>application/zip</dc:format>
<dc:format>text/plain; charset=utf-8</dc:format>
<dc:format>downloadable_files_count: 1</dc:format>
<dc:publisher>Faculty of Computer and Information Science, University of Ljubljana</dc:publisher>
<dc:publisher>Centre for Language Resources and Technologies, University of Ljubljana</dc:publisher>
<dc:source>https://www.cjvt.si/llm4dh/en/</dc:source>
</oai_dc:dc>
</metadata></record></GetRecord></OAI-PMH>