Show simple item record

 
dc.contributor.author Knez, Timotej
dc.contributor.author Žitnik, Slavko
dc.date.accessioned 2026-04-14T13:07:06Z
dc.date.available 2026-04-14T13:07:06Z
dc.date.issued 2026-04-14
dc.identifier.uri http://hdl.handle.net/11356/2116
dc.description The Slovene Lexicographic QA Fine-Tuning Corpus is a specialized dataset designed to advance the performance of AI models in understanding the structural, grammatical, and semantic nuances of the Slovene language. Comprising over 16,000 question-answer pairs, the corpus shifts away from general knowledge to focus on high-quality lexicographic data, including morphology, lemmatization, and part-of-speech identification. It serves as a critical resource for fine-tuning models to act as sophisticated linguistic assistants. The dataset integrates diverse sources, ranging from automatically generated content based on the Digital Dictionary Database of Slovene (DDDS) to manual expert advice from the Jezikovna svetovalnica portal. This hybrid approach ensures a robust mix of systematic grammatical queries and nuanced, real-world linguistic explanations. With a significant portion of the data derived from annotated linguistic corpora like SSJ500k, the dataset provides a reliable foundation for training models in both context-free definitions and context-dependent usage scenarios. Technically, the corpus is structured for high utility in machine learning workflows, featuring a 90/10 training and test split with metadata for each entry. It categorizes questions into specific types such as definitions and usage examples, allowing researchers to perform targeted domain adaptation. By providing clear links between questions and specific lexemes, the corpus enables precise evaluation of a model's ability to navigate the formal rules and practical applications of the Slovene lexicon.
dc.language.iso slv
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://www.cjvt.si/llm4dh/
dc.subject question answering
dc.subject Slovene grammar
dc.subject large language models
dc.title Slovene Lexicographic QA Fine-Tuning Corpus SloLexQA 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Timotej Knez timotej.knez@fri.uni-lj.si UL FRI
sponsor ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor European Union HORIZON-WIDERA-2023-TALENTS-01-01 101186647 EU Era Chair (AI4DH) euFunds
size.info 16508 items
files.count 2
files.size 53994610


 Files in this item

 Download all files in item (51.49 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
lexical_qa_train.json
Size
46.32 MB
Format
Unknown
Description
Train set
MD5
1e3b95816477e4471c23743f97508eba
 Download file
Icon
Name
lexical_qa_test.json
Size
5.17 MB
Format
Unknown
Description
Test set
MD5
bcc908e52cfa5ccf3ea9c611017b0d49
 Download file

Show simple item record