| dc.contributor.author | Knez, Timotej |
| dc.contributor.author | Žitnik, Slavko |
| dc.date.accessioned | 2026-04-14T13:07:06Z |
| dc.date.available | 2026-04-14T13:07:06Z |
| dc.date.issued | 2026-04-14 |
| dc.identifier.uri | http://hdl.handle.net/11356/2116 |
| dc.description | The Slovene Lexicographic QA Fine-Tuning Corpus is a specialized dataset designed to advance the performance of AI models in understanding the structural, grammatical, and semantic nuances of the Slovene language. Comprising over 16,000 question-answer pairs, the corpus shifts away from general knowledge to focus on high-quality lexicographic data, including morphology, lemmatization, and part-of-speech identification. It serves as a critical resource for fine-tuning models to act as sophisticated linguistic assistants. The dataset integrates diverse sources, ranging from automatically generated content based on the Digital Dictionary Database of Slovene (DDDS) to manual expert advice from the Jezikovna svetovalnica portal. This hybrid approach ensures a robust mix of systematic grammatical queries and nuanced, real-world linguistic explanations. With a significant portion of the data derived from annotated linguistic corpora like SSJ500k, the dataset provides a reliable foundation for training models in both context-free definitions and context-dependent usage scenarios. Technically, the corpus is structured for high utility in machine learning workflows, featuring a 90/10 training and test split with metadata for each entry. It categorizes questions into specific types such as definitions and usage examples, allowing researchers to perform targeted domain adaptation. By providing clear links between questions and specific lexemes, the corpus enables precise evaluation of a model's ability to navigate the formal rules and practical applications of the Slovene lexicon. |
| dc.language.iso | slv |
| dc.publisher | Faculty of Computer and Information Science, University of Ljubljana |
| dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
| dc.rights.label | PUB |
| dc.source.uri | https://www.cjvt.si/llm4dh/ |
| dc.subject | question answering |
| dc.subject | Slovene grammar |
| dc.subject | large language models |
| dc.title | Slovene Lexicographic QA Fine-Tuning Corpus SloLexQA 1.0 |
| dc.type | corpus |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Timotej Knez timotej.knez@fri.uni-lj.si UL FRI |
| sponsor | ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds |
| sponsor | European Union HORIZON-WIDERA-2023-TALENTS-01-01 101186647 EU Era Chair (AI4DH) euFunds |
| size.info | 16508 items |
| files.count | 2 |
| files.size | 53994610 |
Files in this item
Download all files in item (51.49 MB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- lexical_qa_train.json
- Size
- 46.32 MB
- Format
- Unknown
- Description
- Train set
- MD5
- 1e3b95816477e4471c23743f97508eba
- Name
- lexical_qa_test.json
- Size
- 5.17 MB
- Format
- Unknown
- Description
- Test set
- MD5
- bcc908e52cfa5ccf3ea9c611017b0d49