Dataset of annotated collocation-distractor pairs COLLDIST

Name: Dataset of annotated collocation-distractor pairs COLLDIST
License: https://creativecommons.org/licenses/by/4.0/

Kosem, Iztok; Arhar Holdt, Špela; Zgaga, Karolina; Arčon, Tjaša

Prikaži enostavni zapis vnosa

dc.contributor.author	Kosem, Iztok
dc.contributor.author	Arhar Holdt, Špela
dc.contributor.author	Zgaga, Karolina
dc.contributor.author	Arčon, Tjaša
dc.date.accessioned	2025-12-23T17:06:19Z
dc.date.available	2025-12-23T17:06:19Z
dc.date.issued	2025-12-23
dc.identifier.uri	http://hdl.handle.net/11356/2076
dc.description	The dataset contains 59,598 collocation-distractor pairs for 2,856 headwords. Distractor is defined as an incorrect answer/alternative to collocation, which can be similar to collocation meaning and/or form. Headwords and their collocations were obtained from the Collocations Dictionary of Modern Slovene (http://hdl.handle.net/11356/1933), which is part of the Dictionary Database of Slovene (the database is available via API: https://wiki.cjvt.si/books/digital-dictionary-database-of-slovene). The criteria for selecting the collocations were that they had to be manually validated and assigned under one of the senses of the headword. The distractors were obtained with the gpt-4o-2024-08-06 (https://platform.openai.com/docs/models/gpt-4o) model, using the following prompt: “We are preparing a language game where the player will be given a headword, a collocation (combination of the headword and another word) and a distractor (a collocation that has the same headword, but the other word is not a collocate of the main word). For example "huge victory" is a collocation of "victory", but "rotten victory" is not so "rotten is a good distractor. The rules for forming distractors are the following: 1. Distractor has to be a single word. 2. Distractor has to have the same part of speech as the word being replaced (e.g. if the word next to the headword is a noun, the distractor should also be a noun). 3. Distractor should not include sensitive vocabulary, e.g. related to minorities, nationalities, religion, sexual content and similar. 4. Distractor has to be a word that is frequent in the Slovene language. 5. Distractor has to be a word that is completely unlikely to occur with the headword. Return the distractor in the same format as the examples below: Example: hiter - hitre rešitve (hiter + rešitve) - hitre težave (hiter + težava) Example: obljuba - držati obljubo (držati + obljuba) - najti obljubo (najti + obljuba). This is the headword: {headword}. This is the collocation: {collocation} ({all_collocation_parts}). The distractor has to be a collocation that contains the headword {headword} but is unlikely to occur with it. Only return the distractor in the correct format with the given headword {headword}. No explanations, no other text.” The evaluation was conducted in two parts, automatic and manual. Five decisions were used: good, good-possible_collocation, bad-wrong_headword, bad-same_as_collocation, bad-grammar_problem, bad-collocation. The automatic evaluation used the frequency information from the data warehouse containing all the collocations from the Gigafida 2.0 corpus. If the distractor was found there, it was considered bad (bad-collocation), otherwise good. We also identified the distractors that were the same as the collocations provided (bad-same_as_collocation), and the distractors where the headword was not kept (bad-wrong_headword). The remaining distractors were manually evaluated and those with grammatical problems (collocate and headword not matching in case, number, gender etc.) were labelled as bad (bad-grammar_problem). All the others were considered good (45,772 out of 59,598, or 77 %), however we also annotated those that were deemed as possible collocations (i.e. the combination sounded as possible in the language, e.g. to party loudly) (good-possible_collocation). The dataset also includes the information on the frequency and logDice statistics of collocations and the distractors in the Gigafida 2.0 reference corpus of Slovene (http://hdl.handle.net/11356/1320).
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby	https://elex.link/elex2025/wp-content/uploads/eLex2025-37-KosemArhar-Holdt.pdf
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/llm4dh/en/
dc.subject	collocations
dc.subject	distractors
dc.subject	large language models
dc.subject	manual annotation
dc.title	Dataset of annotated collocation-distractor pairs COLLDIST
dc.type	lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType	other
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Iztok Kosem iztok.kosem@fri.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor	Public Agency for Scientific Research and Innovation of the Republic of Slovenia GC-0002 Large Language Models for Digital Humanities (LLM4DH) nationalFunds
sponsor	University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	European Union HORIZON-WIDERA-2023-TALENTS-01-01 101186647 EU Era Chair (AI4DH) euFunds
size.info	59598 entries
files.count	1
files.size	1534093