| dc.contributor.author | Kosem, Iztok |
| dc.contributor.author | Arhar Holdt, Špela |
| dc.contributor.author | Zgaga, Karolina |
| dc.contributor.author | Arčon, Tjaša |
| dc.date.accessioned | 2025-12-23T17:06:19Z |
| dc.date.available | 2025-12-23T17:06:19Z |
| dc.date.issued | 2025-12-23 |
| dc.identifier.uri | http://hdl.handle.net/11356/2076 |
| dc.description | The dataset contains 59,598 collocation-distractor pairs for 2,856 headwords. Distractor is defined as an incorrect answer/alternative to collocation, which can be similar to collocation meaning and/or form. Headwords and their collocations were obtained from the Collocations Dictionary of Modern Slovene (http://hdl.handle.net/11356/1933), which is part of the Dictionary Database of Slovene (the database is available via API: https://wiki.cjvt.si/books/digital-dictionary-database-of-slovene). The criteria for selecting the collocations were that they had to be manually validated and assigned under one of the senses of the headword. The distractors were obtained with the gpt-4o-2024-08-06 (https://platform.openai.com/docs/models/gpt-4o) model, using the following prompt: “We are preparing a language game where the player will be given a headword, a collocation (combination of the headword and another word) and a distractor (a collocation that has the same headword, but the other word is not a collocate of the main word). For example "huge victory" is a collocation of "victory", but "rotten victory" is not so "rotten is a good distractor. The rules for forming distractors are the following: 1. Distractor has to be a single word. 2. Distractor has to have the same part of speech as the word being replaced (e.g. if the word next to the headword is a noun, the distractor should also be a noun). 3. Distractor should not include sensitive vocabulary, e.g. related to minorities, nationalities, religion, sexual content and similar. 4. Distractor has to be a word that is frequent in the Slovene language. 5. Distractor has to be a word that is completely unlikely to occur with the headword. Return the distractor in the same format as the examples below: Example: hiter - hitre rešitve (hiter + rešitve) - hitre težave (hiter + težava) Example: obljuba - držati obljubo (držati + obljuba) - najti obljubo (najti + obljuba). This is the headword: {headword}. This is the collocation: {collocation} ({all_collocation_parts}). The distractor has to be a collocation that contains the headword {headword} but is unlikely to occur with it. Only return the distractor in the correct format with the given headword {headword}. No explanations, no other text.” The evaluation was conducted in two parts, automatic and manual. Five decisions were used: good, good-possible_collocation, bad-wrong_headword, bad-same_as_collocation, bad-grammar_problem, bad-collocation. The automatic evaluation used the frequency information from the data warehouse containing all the collocations from the Gigafida 2.0 corpus. If the distractor was found there, it was considered bad (bad-collocation), otherwise good. We also identified the distractors that were the same as the collocations provided (bad-same_as_collocation), and the distractors where the headword was not kept (bad-wrong_headword). The remaining distractors were manually evaluated and those with grammatical problems (collocate and headword not matching in case, number, gender etc.) were labelled as bad (bad-grammar_problem). All the others were considered good (45,772 out of 59,598, or 77 %), however we also annotated those that were deemed as possible collocations (i.e. the combination sounded as possible in the language, e.g. to party loudly) (good-possible_collocation). The dataset also includes the information on the frequency and logDice statistics of collocations and the distractors in the Gigafida 2.0 reference corpus of Slovene (http://hdl.handle.net/11356/1320). |
| dc.language.iso | slv |
| dc.publisher | Faculty of Computer and Information Science, University of Ljubljana |
| dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
| dc.relation.isreferencedby | https://elex.link/elex2025/wp-content/uploads/eLex2025-37-KosemArhar-Holdt.pdf |
| dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
| dc.rights.label | PUB |
| dc.source.uri | https://www.cjvt.si/llm4dh/en/ |
| dc.subject | collocations |
| dc.subject | distractors |
| dc.subject | large language models |
| dc.subject | manual annotation |
| dc.title | Dataset of annotated collocation-distractor pairs COLLDIST |
| dc.type | lexicalConceptualResource |
| metashare.ResourceInfo#ContentInfo.detailedType | other |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Iztok Kosem iztok.kosem@fri.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana |
| sponsor | Public Agency for Scientific Research and Innovation of the Republic of Slovenia GC-0002 Large Language Models for Digital Humanities (LLM4DH) nationalFunds |
| sponsor | University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds |
| sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
| size.info | 59598 entries |
| files.count | 1 |
| files.size | 1534093 |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
z licenco:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Ime
- COLLDIST.zip
- Velikost
- 1.46 MB
- Format
- application/zip
- Opis
- database + Readme
- MD5
- b05fdc75fa1c7d88c700fcfa883ada92