Prikaži enostavni zapis vnosa

 
dc.contributor.author Knez, Timotej
dc.contributor.author Žitnik, Slavko
dc.date.accessioned 2026-04-14T13:07:06Z
dc.date.available 2026-04-14T13:07:06Z
dc.date.issued 2026-04-14
dc.identifier.uri http://hdl.handle.net/11356/2116
dc.description The Slovene Lexicographic QA Fine-Tuning Corpus is a specialized dataset designed to advance the performance of AI models in understanding the structural, grammatical, and semantic nuances of the Slovene language. Comprising over 16,000 question-answer pairs, the corpus shifts away from general knowledge to focus on high-quality lexicographic data, including morphology, lemmatization, and part-of-speech identification. It serves as a critical resource for fine-tuning models to act as sophisticated linguistic assistants. The dataset integrates diverse sources, ranging from automatically generated content based on the Digital Dictionary Database of Slovene (DDDS) to manual expert advice from the Jezikovna svetovalnica portal. This hybrid approach ensures a robust mix of systematic grammatical queries and nuanced, real-world linguistic explanations. With a significant portion of the data derived from annotated linguistic corpora like SSJ500k, the dataset provides a reliable foundation for training models in both context-free definitions and context-dependent usage scenarios. Technically, the corpus is structured for high utility in machine learning workflows, featuring a 90/10 training and test split with metadata for each entry. It categorizes questions into specific types such as definitions and usage examples, allowing researchers to perform targeted domain adaptation. By providing clear links between questions and specific lexemes, the corpus enables precise evaluation of a model's ability to navigate the formal rules and practical applications of the Slovene lexicon.
dc.language.iso slv
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://www.cjvt.si/llm4dh/
dc.subject question answering
dc.subject Slovene grammar
dc.subject large language models
dc.title Slovene Lexicographic QA Fine-Tuning Corpus SloLexQA 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Timotej Knez timotej.knez@fri.uni-lj.si UL FRI
sponsor ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor European Union HORIZON-WIDERA-2023-TALENTS-01-01 101186647 EU Era Chair (AI4DH) euFunds
size.info 16508 items
files.count 2
files.size 53994610


 Datoteke v tem vnosu

 Prenesi vse datoteke v vnosu (51.49 MB)
To je vnos
Publicly Available
z licenco:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Ime
lexical_qa_train.json
Velikost
46.32 MB
Format
Neznano
Opis
Train set
MD5
1e3b95816477e4471c23743f97508eba
 Prenesi datoteko
Icon
Ime
lexical_qa_test.json
Velikost
5.17 MB
Format
Neznano
Opis
Test set
MD5
bcc908e52cfa5ccf3ea9c611017b0d49
 Prenesi datoteko

Prikaži enostavni zapis vnosa