dc.contributor.author | Vreš, Domen |
dc.contributor.author | Arčon, Tjaša |
dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Robnik-Šikonja, Marko |
dc.contributor.author | Krek, Simon |
dc.contributor.author | Gabrovšek, Dejan |
dc.contributor.author | Ježovnik, Janoš |
dc.contributor.author | Kastelic, Maja |
dc.contributor.author | Krvina, Domen |
dc.contributor.author | Ledinek, Nina |
dc.contributor.author | Michelizza, Mija |
dc.contributor.author | Perdih, Andrej |
dc.contributor.author | Petric Žižić, Špela |
dc.contributor.author | Trojar, Mitja |
dc.date.accessioned | 2024-09-30T13:16:46Z |
dc.date.available | 2024-09-30T13:16:46Z |
dc.date.issued | 2024-09-25 |
dc.identifier.uri | http://hdl.handle.net/11356/1971 |
dc.description | GaMS-Instruct-GEN is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions. It consists of pairs of prompts and responses, some of which contain an additional input field. The dataset was generated automatically using GPT-4 by using 225 manually compiled seed prompts from SelfInstruct (Wang et al. 2022), an instruction-following dataset for English (https://huggingface.co/datasets/yizhongw/self_instruct). The seed prompts were manually translated into Slovene (see "seed_tasks_sl.jsonl") and used as part of a prompt to generate additional similar examples (see 00README.txt for more details). The automatically generated examples were manually validated by 9 annotators (linguists). Version 1.0 contains only prompt-response pairs that are adequately formatted and free of LLM-hallucinations. Most of the prompt-response pairs deal with general topics (e.g. essay writing, event organization, text corrections, creative tasks), while some deal with Slovene-specific topics (e.g. planning trips around Slovenia, prompts referring to Slovene literature or culture). |
dc.language.iso | slv |
dc.publisher | Faculty of Computer and Information Science, University of Ljubljana |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.cjvt.si/povejmo/en/project/ |
dc.subject | large language models |
dc.subject | instruction following dataset |
dc.title | Slovene instruction-following dataset for large language models GaMS-Instruct-GEN 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Domen Vreš domen.vres@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana |
contact.person | Jaka Čibej jaka.cibej@ff.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | ARIS (Slovenian Research and Innovation Agency) NOO PoVeJMo research project (Adaptive Natural Language Processing with Large Language Models) nationalFunds |
size.info | 6832 units |
files.count | 1 |
files.size | 3268682 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)



- Name
- GaMS-Instruct-GEN_1.0.zip
- Size
- 3.12 MB
- Format
- application/zip
- Description
- GaMS-Instruct-GEN 1.0 (JSON)
- MD5
- 1def8ed37160b023b4f69637477591ed
- GaMS-Instruct-GEN_1.0
- GaMS-Instruct-GEN_1.0.json9 MB
- 00README.txt8 kB
- seed_tasks_sl.jsonl135 kB