Slovene instruction-following dataset for large language models GaMS-Instruct-PHARMA 1.0

Name: Slovene instruction-following dataset for large language models GaMS-Instruct-PHARMA 1.0
License: https://creativecommons.org/licenses/by/4.0/

Plesnik, Emil; Morić, Ariana; Tovornik, Robert; Fabjan, Borut

Show simple item record

dc.contributor.author	Plesnik, Emil
dc.contributor.author	Morić, Ariana
dc.contributor.author	Tovornik, Robert
dc.contributor.author	Fabjan, Borut
dc.date.accessioned	2026-02-10T16:00:13Z
dc.date.available	2026-02-10T16:00:13Z
dc.date.issued	2026-02-03
dc.identifier.uri	http://hdl.handle.net/11356/2081
dc.description	GaMS-Instruct-PHARMA is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain, particularly in the domain of pharmaceutical drugs and their effects. The dataset is based on official Slovene pharmaceutical databases that are publicly accessible on the websites of the Slovenian Database of Medicinal Products (Centralna baza zdravil, https://www.cbz.si) and the Agency for Medicinal Products and Medical Devices of the Republic of Slovenia (Javna agencija Republike Slovenije za zdravila in medicinske pripomočke; JAZMP; https://www.jazmp.si). Version 1.0 contains 482,276 instructions (i.e. prompt-response pairs), which are useful in natural language processing, computational linguistics, and medical informatics. It can be used for research and development projects for fine-tuning language models, training and fine-tuning LLMs for the pharmaceutical domain, developing medical chatbots and assistants in Slovene, supporting pharmaceutical and medical workers in searching information on pharmaceutical drugs, and so on. The dataset consists of two data files: • JSON: GaMS-Instruct-PHARMA_1.0.json (235 MB) - formatted for inspection • JSONL: GaMS-Instruct-PHARMA_1.0.jsonl (210 MB) - optimized for training models Statistics on the dataset are provided in GaMS-Instruct-PHARMA_1.0_dataset_statistics.json. For more information, please consult 00README.txt and the accompanying documentation. Please note that the current version of the dataset does not guarantee full clinical accuracy and may contain errors as a consequence of LLM hallucinations.
dc.language.iso	slv
dc.publisher	Better, d.o.o.
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/povejmo/
dc.subject	instruction following dataset
dc.subject	medical texts
dc.subject	large language models
dc.subject	pharmaceutical texts
dc.title	Slovene instruction-following dataset for large language models GaMS-Instruct-PHARMA 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Borut Fabjan info@better.care Better, d.o.o.
sponsor	ARIS (Slovenian Research and Innovation Agency) NOO PoVeJMo research project (Adaptive Natural Language Processing with Large Language Models) nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info	482276 units
files.count	1
files.size	49893207

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)

Name: GaMS-Instruct-PHARMA_1.0.zip
Size: 47.58 MB
Format: application/zip
Description: JSON + JSONL
MD5: 69613d11439e42c87bc4245a1b0a3761

Download file Preview

File Preview

GaMS-Instruct-PHARMA_1.0
- GaMS-Instruct-PHARMA_1.0_docs.pdf242 kB
- GaMS-Instruct-PHARMA_1.0_dataset_statistics.json787 B
- GaMS-Instruct-PHARMA_1.0_docs.md16 kB
- GaMS-Instruct-PHARMA_1.0.jsonl201 MB
- GaMS-Instruct-PHARMA_1.0_docs.docx36 kB
- 00README.txt14 kB
- GaMS-Instruct-PHARMA_1.0.json225 MB

Show simple item record

Files in this item

Partners

Partners

Repository