dc.contributor.author | Tovornik, Robert |
dc.contributor.author | Pavlović, Anđela |
dc.contributor.author | Radnić, Vuk |
dc.contributor.author | Plesnik, Emil |
dc.contributor.author | Fabjan, Borut |
dc.date.accessioned | 2025-09-02T15:05:47Z |
dc.date.available | 2025-09-02T15:05:47Z |
dc.date.issued | 2025-08-25 |
dc.identifier.uri | http://hdl.handle.net/11356/2045 |
dc.description | GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of units of prompts, instrutions and responses from the field of medicine, particularly those pertaining to the use of pharmaceutical drugs and medications. The dataset was generated in several steps (for a more detailed description, please refer to 00README.txt). After consulting with experts from the medical field, a series of prompts was manually compiled containing questions interesting in the context of drug and medication use. For each medication in the PoVeJMo-VeMo-Med 1.0 dataset (http://hdl.handle.net/11356/1983), approximately 10-15 questions were automatically generated using prompt tuning. In version 2.0, the dataset was extended with several other similar datasets for English that were translated into Slovene: MedQuAD, MeQSum, Medication QA, and LiveQA (references are available in 00README.txt). All translations were made automatically using GPT-4.1. The manual validation was made in two phases. In the preparation-evaluation phase, the quality of machine translations were validated on a sample using different machine translation applications (DeepL, OpenAI) to determine the solution with optimal performance. In the second phase, a random sample of 20--40 examples from each translated subset were manually validated (a total of 240 examples). The manual validations were made by two experts from the field of medicine and an expert for dataset compilation. Unlike version 1.0, where the dataset consisted of prompt-response pairs, version 2.0 contains units consisting of three elements (instruction-input-output). The conversion was made using OpenAI GPT-4.1. All final instructions were manually validated by an expert for dataset compilation. Two experts from the field of medicine participated in the design of clinically relevant categories of instructions, the compilation of examples of prompt-response pairs, and the manual validation of test results of the conversion process. Please note that the current version of the dataset (containing 25,046 instruction-input-output units) does not guarantee full clinical accuracy and may contain errors as a consequence of LLM hallucinations. |
dc.language.iso | slv |
dc.publisher | Better, d.o.o. |
dc.publisher | Faculty of Computer and Information Science, University of Ljubljana |
dc.relation.replaces | http://hdl.handle.net/11356/1982 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.cjvt.si/povejmo/en/project/ |
dc.subject | instruction following dataset |
dc.subject | medical texts |
dc.subject | large language models |
dc.title | Slovene instruction-following dataset for large language models GaMS-Instruct-MED 2.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Borut Fabjan info@better.care Better, d.o.o. |
sponsor | ARIS (Slovenian Research and Innovation Agency) NOO PoVeJMo research project (Adaptive Natural Language Processing with Large Language Models) nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
size.info | 25046 units |
files.count | 1 |
files.size | 43624205 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)



- Name
- GaMS-Instruct-MED_2.0.zip
- Size
- 41.6 MB
- Format
- application/zip
- Description
- GaMS-Instruct-MED 2.0 (JSON)
- MD5
- 7a555629f3736f6341671284eaad94bc
- GaMS-Instruct-MED_2.0
- GaMS-Instruct-MED_2.0.json56 MB
- 00README.txt7 kB
- GaMS-Instruct-MED_2.0_context.json112 MB