Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0

Name: Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0
License: https://creativecommons.org/licenses/by/4.0/

Vreš, Domen; Arčon, Tjaša; Čibej, Jaka; Robnik-Šikonja, Marko; Krek, Simon; Gabrovšek, Dejan; Ježovnik, Janoš; Kastelic, Maja; Krvina, Domen; Ledinek, Nina; Michelizza, Mija; Perdih, Andrej; Petric Žižić, Špela; Trojar, Mitja

Show simple item record

dc.contributor.author	Vreš, Domen
dc.contributor.author	Arčon, Tjaša
dc.contributor.author	Čibej, Jaka
dc.contributor.author	Robnik-Šikonja, Marko
dc.contributor.author	Krek, Simon
dc.contributor.author	Gabrovšek, Dejan
dc.contributor.author	Ježovnik, Janoš
dc.contributor.author	Kastelic, Maja
dc.contributor.author	Krvina, Domen
dc.contributor.author	Ledinek, Nina
dc.contributor.author	Michelizza, Mija
dc.contributor.author	Perdih, Andrej
dc.contributor.author	Petric Žižić, Špela
dc.contributor.author	Trojar, Mitja
dc.date.accessioned	2024-09-30T13:16:46Z
dc.date.available	2024-09-30T13:16:46Z
dc.date.issued	2024-09-25
dc.identifier.uri	http://hdl.handle.net/11356/1971
dc.description	GaMS-Instruct-GEN is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions. It consists of pairs of prompts and responses, some of which contain an additional input field. The dataset was generated automatically using GPT-4 by using 225 manually compiled seed prompts from SelfInstruct (Wang et al. 2022), an instruction-following dataset for English (https://huggingface.co/datasets/yizhongw/self_instruct). The seed prompts were manually translated into Slovene (see "seed_tasks_sl.jsonl") and used as part of a prompt to generate additional similar examples (see 00README.txt for more details). The automatically generated examples were manually validated by 9 annotators (linguists). Version 1.0 contains only prompt-response pairs that are adequately formatted and free of LLM-hallucinations. Most of the prompt-response pairs deal with general topics (e.g. essay writing, event organization, text corrections, creative tasks), while some deal with Slovene-specific topics (e.g. planning trips around Slovenia, prompts referring to Slovene literature or culture).
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/povejmo/en/project/
dc.subject	large language models
dc.subject	instruction following dataset
dc.title	Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Domen Vreš domen.vres@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
contact.person	Jaka Čibej jaka.cibej@ff.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	ARIS (Slovenian Research and Innovation Agency) NOO PoVeJMo research project (Adaptive Natural Language Processing with Large Language Models) nationalFunds
size.info	6832 units
files.count	1
files.size	3268682