Slovene instruction-following dataset for large language models GaMS-Instruct-MED 1.0

Name: Slovene instruction-following dataset for large language models GaMS-Instruct-MED 1.0
License: https://creativecommons.org/licenses/by/4.0/

Tovornik, Robert; Pavlović, Anđela; Plesnik, Emil; Fabjan, Borut

Show simple item record

dc.contributor.author	Tovornik, Robert
dc.contributor.author	Pavlović, Anđela
dc.contributor.author	Plesnik, Emil
dc.contributor.author	Fabjan, Borut
dc.date.accessioned	2024-11-06T16:17:02Z
dc.date.available	2024-11-06T16:17:02Z
dc.date.issued	2024-09-25
dc.identifier.uri	http://hdl.handle.net/11356/1982
dc.description	GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of pairs of prompts and responses from the field of medicine, particularly those pertaining to the use of pharmaceutical drugs and medications. The dataset was generated in several steps. After consulting with experts from the medical field, a series of prompts was manually compiled containing questions interesting in the context of drug and medication use. For each medication in the PoVeJMo-VeMo-Med 1.0 dataset (http://hdl.handle.net/11356/1983), approximately 10-15 questions were automatically generated using prompt tuning. The questions followed the context of the instructions of use for the medication in question. Inadequate questions were manually excluded, while the responses were generated entirely automatically using a specialized RAG system. Please note that the current version of the dataset (containing 18,897 prompt-response pairs) does not guarantee clinical accuracy and may contain errors as a consequence of LLM hallucinations.
dc.language.iso	slv
dc.publisher	Better, d.o.o.
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.relation.isreplacedby	http://hdl.handle.net/11356/2045
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/povejmo/en/project/
dc.subject	instruction following dataset
dc.subject	medical texts
dc.subject	large language models
dc.title	Slovene instruction-following dataset for large language models GaMS-Instruct-MED 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
hidden	hidden
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Borut Fabjan info@better.care Better, d.o.o.
sponsor	ARIS (Slovenian Research and Innovation Agency) NOO PoVeJMo research project (Adaptive Natural Language Processing with Large Language Models) nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info	18897 units
files.count	1
files.size	4801364