Prikaži enostavni zapis vnosa

 
dc.contributor.author Škvorc, Tadej
dc.contributor.author Robnik-Šikonja, Marko
dc.date.accessioned 2025-01-27T17:13:19Z
dc.date.available 2025-01-27T17:13:19Z
dc.date.issued 2025-01-24
dc.identifier.uri http://hdl.handle.net/11356/2008
dc.description SloDicWSD is a Slovene word-sense disambiguation (WSD) corpus generated from data contained in SSKJ (Slovar slovenskega knjižnega jezika, the largest dictionary of standard Slovene). The corpus is an automatically constructed WSD dataset based on the sense inventory from the SSKJ dictionary and consists of SSKJ dictionary use-case examples converted to complete sentences using GPT-3.5 Turbo (https://platform.openai.com/docs/models#gpt-3-5-turbo). We limited the corpus to the top 758 lemmas present in the Slovene part of the Elexis-WSD dataset (http://hdl.handle.net/11356/1842). For each lemma, we extracted every usage example from the SSKJ dictionary and labeled it with the matching sense. As these usage examples are likely too short to be useful for the WSD task, we extended them using GPT-3.5. We automatically filtered sentences that contain one of the two errors: 1. The original dictionary lemma was not present in the full sentence. While we prompted GPT-3.5 to generate complete sentences by extending existing examples, GPT-3.5 sometimes omitted the original lemma. 2. The generated sentence was identical to one of the already generated sentences. Thesentences generated by GPT-3.5 are not guaranteed to be unique; therefore, we discarded duplicates.
dc.language.iso slv
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.subject WSD
dc.subject word sense disambiguation
dc.subject GPT
dc.title Word-sense disambiguation corpus SloDicWSD 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tadej Škvorc tadej.skvorc@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor European Union C3.K8.IB PoVeJMo: Adaptive Natural Language Processing with Large Language Models - Co-financing of research innovation projects in support of green transition and digitalisation nationalFund
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) L2-50070 Embeddings-based techniques for Media Monitoring Applications nationalFunds
sponsor ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor ARIS (Slovenian Research and Innovation Agency) V5-2297 Media Landscape in Slovenia between Pluralisation and Homogenisation nationalFunds
size.info 758 entries
size.info 110685 sentences
size.info 2029862 words
files.count 1
files.size 3153018


 Datoteke v tem vnosu

To je vnos
Publicly Available
z licenco:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Ime
SloDictWSD.zip
Velikost
3.01 MB
Format
application/zip
Opis
Data in json format
MD5
22bee7c77d4cd164b55912c5c22c8d21
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • SloDictWSD.json-1 B

Prikaži enostavni zapis vnosa