dc.contributor.author | Škvorc, Tadej |
dc.contributor.author | Robnik-Šikonja, Marko |
dc.date.accessioned | 2025-01-27T17:13:19Z |
dc.date.available | 2025-01-27T17:13:19Z |
dc.date.issued | 2025-01-24 |
dc.identifier.uri | http://hdl.handle.net/11356/2008 |
dc.description | SloDicWSD is a Slovene word-sense disambiguation (WSD) corpus generated from data contained in SSKJ (Slovar slovenskega knjižnega jezika, the largest dictionary of standard Slovene). The corpus is an automatically constructed WSD dataset based on the sense inventory from the SSKJ dictionary and consists of SSKJ dictionary use-case examples converted to complete sentences using GPT-3.5 Turbo (https://platform.openai.com/docs/models#gpt-3-5-turbo). We limited the corpus to the top 758 lemmas present in the Slovene part of the Elexis-WSD dataset (http://hdl.handle.net/11356/1842). For each lemma, we extracted every usage example from the SSKJ dictionary and labeled it with the matching sense. As these usage examples are likely too short to be useful for the WSD task, we extended them using GPT-3.5. We automatically filtered sentences that contain one of the two errors: 1. The original dictionary lemma was not present in the full sentence. While we prompted GPT-3.5 to generate complete sentences by extending existing examples, GPT-3.5 sometimes omitted the original lemma. 2. The generated sentence was identical to one of the already generated sentences. Thesentences generated by GPT-3.5 are not guaranteed to be unique; therefore, we discarded duplicates. |
dc.language.iso | slv |
dc.publisher | Faculty of Computer and Information Science, University of Ljubljana |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.subject | WSD |
dc.subject | word sense disambiguation |
dc.subject | GPT |
dc.title | Word-sense disambiguation corpus SloDicWSD 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tadej Škvorc tadej.skvorc@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana |
sponsor | European Union C3.K8.IB PoVeJMo: Adaptive Natural Language Processing with Large Language Models - Co-financing of research innovation projects in support of green transition and digitalisation nationalFund |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) L2-50070 Embeddings-based techniques for Media Monitoring Applications nationalFunds |
sponsor | ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds |
sponsor | ARIS (Slovenian Research and Innovation Agency) V5-2297 Media Landscape in Slovenia between Pluralisation and Homogenisation nationalFunds |
size.info | 758 entries |
size.info | 110685 sentences |
size.info | 2029862 words |
files.count | 1 |
files.size | 3153018 |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
z licenco:Creative Commons - Attribution 4.0 International (CC BY 4.0)



- Ime
- SloDictWSD.zip
- Velikost
- 3.01 MB
- Format
- application/zip
- Opis
- Data in json format
- MD5
- 22bee7c77d4cd164b55912c5c22c8d21