Show simple item record

 
dc.contributor.author Vreš, Domen
dc.contributor.author Arčon, Tjaša
dc.contributor.author Čibej, Jaka
dc.contributor.author Robnik-Šikonja, Marko
dc.contributor.author Krek, Simon
dc.contributor.author Gabrovšek, Dejan
dc.contributor.author Ježovnik, Janoš
dc.contributor.author Kastelic, Maja
dc.contributor.author Krvina, Domen
dc.contributor.author Ledinek, Nina
dc.contributor.author Michelizza, Mija
dc.contributor.author Perdih, Andrej
dc.contributor.author Petric Žižić, Špela
dc.contributor.author Trojar, Mitja
dc.date.accessioned 2024-09-30T13:16:46Z
dc.date.available 2024-09-30T13:16:46Z
dc.date.issued 2024-09-25
dc.identifier.uri http://hdl.handle.net/11356/1971
dc.description GaMS-Instruct-GEN is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions. It consists of pairs of prompts and responses, some of which contain an additional input field. The dataset was generated automatically using GPT-4 by using 225 manually compiled seed prompts from SelfInstruct (Wang et al. 2022), an instruction-following dataset for English (https://huggingface.co/datasets/yizhongw/self_instruct). The seed prompts were manually translated into Slovene (see "seed_tasks_sl.jsonl") and used as part of a prompt to generate additional similar examples (see 00README.txt for more details). The automatically generated examples were manually validated by 9 annotators (linguists). Version 1.0 contains only prompt-response pairs that are adequately formatted and free of LLM-hallucinations. Most of the prompt-response pairs deal with general topics (e.g. essay writing, event organization, text corrections, creative tasks), while some deal with Slovene-specific topics (e.g. planning trips around Slovenia, prompts referring to Slovene literature or culture).
dc.language.iso slv
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://www.cjvt.si/povejmo/en/project/
dc.subject large language models
dc.subject instruction following dataset
dc.title Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Domen Vreš domen.vres@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
contact.person Jaka Čibej jaka.cibej@ff.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARIS (Slovenian Research and Innovation Agency) NOO PoVeJMo research project (Adaptive Natural Language Processing with Large Language Models) nationalFunds
size.info 6832 units
files.count 1
files.size 3268682


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
GaMS-Instruct-GEN_1.0.zip
Size
3.12 MB
Format
application/zip
Description
GaMS-Instruct-GEN 1.0 (JSON)
MD5
1def8ed37160b023b4f69637477591ed
 Download file  Preview
 File Preview  

Show simple item record