Slovene Lexicographic QA Fine-Tuning Corpus SloLexQA 1.0

Name: Slovene Lexicographic QA Fine-Tuning Corpus SloLexQA 1.0
License: https://creativecommons.org/licenses/by/4.0/

Knez, Timotej; Žitnik, Slavko

Show simple item record

dc.contributor.author	Knez, Timotej
dc.contributor.author	Žitnik, Slavko
dc.date.accessioned	2026-04-14T13:07:06Z
dc.date.available	2026-04-14T13:07:06Z
dc.date.issued	2026-04-14
dc.identifier.uri	http://hdl.handle.net/11356/2116
dc.description	The Slovene Lexicographic QA Fine-Tuning Corpus is a specialized dataset designed to advance the performance of AI models in understanding the structural, grammatical, and semantic nuances of the Slovene language. Comprising over 16,000 question-answer pairs, the corpus shifts away from general knowledge to focus on high-quality lexicographic data, including morphology, lemmatization, and part-of-speech identification. It serves as a critical resource for fine-tuning models to act as sophisticated linguistic assistants. The dataset integrates diverse sources, ranging from automatically generated content based on the Digital Dictionary Database of Slovene (DDDS) to manual expert advice from the Jezikovna svetovalnica portal. This hybrid approach ensures a robust mix of systematic grammatical queries and nuanced, real-world linguistic explanations. With a significant portion of the data derived from annotated linguistic corpora like SSJ500k, the dataset provides a reliable foundation for training models in both context-free definitions and context-dependent usage scenarios. Technically, the corpus is structured for high utility in machine learning workflows, featuring a 90/10 training and test split with metadata for each entry. It categorizes questions into specific types such as definitions and usage examples, allowing researchers to perform targeted domain adaptation. By providing clear links between questions and specific lexemes, the corpus enables precise evaluation of a model's ability to navigate the formal rules and practical applications of the Slovene lexicon.
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/llm4dh/
dc.subject	question answering
dc.subject	Slovene grammar
dc.subject	large language models
dc.title	Slovene Lexicographic QA Fine-Tuning Corpus SloLexQA 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Timotej Knez timotej.knez@fri.uni-lj.si UL FRI
sponsor	ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor	European Union HORIZON-WIDERA-2023-TALENTS-01-01 101186647 EU Era Chair (AI4DH) euFunds
size.info	16508 items
files.count	2
files.size	53994610