Word-sense disambiguation corpus SloDicWSD 1.0

Name: Word-sense disambiguation corpus SloDicWSD 1.0
License: https://creativecommons.org/licenses/by/4.0/

Škvorc, Tadej; Robnik-Šikonja, Marko

dc.contributor.author	Škvorc, Tadej
dc.contributor.author	Robnik-Šikonja, Marko
dc.date.accessioned	2025-01-27T17:13:19Z
dc.date.available	2025-01-27T17:13:19Z
dc.date.issued	2025-01-24
dc.identifier.uri	http://hdl.handle.net/11356/2008
dc.description	SloDicWSD is a Slovene word-sense disambiguation (WSD) corpus generated from data contained in SSKJ (Slovar slovenskega knjižnega jezika, the largest dictionary of standard Slovene). The corpus is an automatically constructed WSD dataset based on the sense inventory from the SSKJ dictionary and consists of SSKJ dictionary use-case examples converted to complete sentences using GPT-3.5 Turbo (https://platform.openai.com/docs/models#gpt-3-5-turbo). We limited the corpus to the top 758 lemmas present in the Slovene part of the Elexis-WSD dataset (http://hdl.handle.net/11356/1842). For each lemma, we extracted every usage example from the SSKJ dictionary and labeled it with the matching sense. As these usage examples are likely too short to be useful for the WSD task, we extended them using GPT-3.5. We automatically filtered sentences that contain one of the two errors: 1. The original dictionary lemma was not present in the full sentence. While we prompted GPT-3.5 to generate complete sentences by extending existing examples, GPT-3.5 sometimes omitted the original lemma. 2. The generated sentence was identical to one of the already generated sentences. Thesentences generated by GPT-3.5 are not guaranteed to be unique; therefore, we discarded duplicates.
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.subject	WSD
dc.subject	word sense disambiguation
dc.subject	GPT
dc.title	Word-sense disambiguation corpus SloDicWSD 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Tadej Škvorc tadej.skvorc@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor	European Union C3.K8.IB PoVeJMo: Adaptive Natural Language Processing with Large Language Models - Co-financing of research innovation projects in support of green transition and digitalisation nationalFund
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARRS (Slovenian Research Agency) L2-50070 Embeddings-based techniques for Media Monitoring Applications nationalFunds
sponsor	ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor	ARIS (Slovenian Research and Innovation Agency) V5-2297 Media Landscape in Slovenia between Pluralisation and Homogenisation nationalFunds
size.info	758 entries
size.info	110685 sentences
size.info	2029862 words
files.count	1
files.size	3153018