dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Kuzman, Taja |
dc.contributor.author | Rupnik, Peter |
dc.contributor.author | Milosavljević, Stefan |
dc.contributor.author | Galant, Nada |
dc.contributor.author | Benčina, Sonja |
dc.contributor.author | Čibej, Jaka |
dc.date.accessioned | 2024-04-24T16:05:16Z |
dc.date.available | 2024-04-24T16:05:16Z |
dc.date.issued | 2024-04-26 |
dc.identifier.uri | http://hdl.handle.net/11356/1766 |
dc.description | The DIALECT-COPA datasets comprise Choice of Plausible Alternatives (COPA) datasets for three South Slavic dialects: (1) COPA-SL-CER for the Cerkno dialect of Slovenian, spoken in the Slovenian Littoral region, specifically from the town of Idrija; (2) COPA-HR-CKM for the Chakavian dialect of Croatian from northern Adriatic, specifically from the town of Žminj; (3) COPA-SR-TOR for the Torlak dialect from southeastern Serbia, specifically from the town of Lebane. The datasets were translated from the English COPA dataset (https://people.ict.usc.edu/~gordon/copa.html) by native dialect speakers, following the XCOPA dataset translation methodology (https://arxiv.org/abs/2005.00333). A novelty in the DIALECT-COPA translation approach is that both English and the corresponding standard South Slavic language were at disposal to the translator during the translation process. Each instance consists of a premise (My body cast a shadow over the grass), a question (What is the cause? / What happened as a result?), and two choices (The sun was rising; The grass was cut), with a label encoding which of the choices is more plausible given the annotator or translator (The sun was rising). The datasets follow the same format as the Croatian COPA-HR dataset (http://hdl.handle.net/11356/1404), the Macedonian COPA-MK dataset (http://hdl.handle.net/11356/1687) and the Serbian COPA-SR dataset (http://hdl.handle.net/11356/1708). Each dataset is split into training (400 instances) and validation (100 instances) JSONL files. The test split (500 instances), which is usually a part of the COPA datasets, has been withheld and can be shared upon request. The reason for this is to prevent its inclusion of the test instances in the training data of future large language models, which would invalidate the benchmark measurements. The DIALECT-COPA datasets are published as part of the DIALECT-COPA shared task at the VarDial 2024 workshop where they were used as gold data for evaluation of the performance of large language models on South Slavic dialects (https://sites.google.com/view/vardial-2024/shared-tasks/dialect-copa). |
dc.language.iso | slv |
dc.language.iso | hrv |
dc.language.iso | srp |
dc.language.iso | ckm |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://aclanthology.org/2024.vardial-1.7/ |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://sites.google.com/view/vardial-2024/shared-tasks/dialect-copa |
dc.subject | commonsense reasoning |
dc.subject | manual translation |
dc.title | "Choice of plausible alternatives" datasets in South Slavic dialects DIALECT-COPA |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
size.info | 1500 items |
size.info | 6 files |
size.info | 279 kb |
files.count | 6 |
files.size | 286402 |
Datoteke v tem vnosu
Prenesi vse datoteke v vnosu (279.69 KB)To je vnos
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Ime
- copa-sl-cer.train.jsonl
- Velikost
- 72.64 KB
- Format
- Neznano
- Opis
- Training split of COPA-SL-CER
- MD5
- e7f31b5a4c4f1677d67e77132535a5dc

- Ime
- copa-sl-cer.val.jsonl
- Velikost
- 18.45 KB
- Format
- Neznano
- Opis
- Validation split of COPA-SL-CER
- MD5
- 09bbfe4d5dce661e748bb798e943f032

- Ime
- copa-hr-ckm.train.jsonl
- Velikost
- 76.68 KB
- Format
- Neznano
- Opis
- Training split of COPA-HR-CKM
- MD5
- 5c84b2efc1fe43321a8e13013a1279a7

- Ime
- copa-hr-ckm.val.jsonl
- Velikost
- 19.4 KB
- Format
- Neznano
- Opis
- Validation split of COPA-HR-CKM
- MD5
- c5a20701d803fee42cba90c7878fc72b

- Ime
- copa-sr-tor.train.jsonl
- Velikost
- 74.18 KB
- Format
- Neznano
- Opis
- Training split of COPA-SR-TOR
- MD5
- 630aca10135c12d668a68ab1e62375b4

- Ime
- copa-sr-tor.val.jsonl
- Velikost
- 18.34 KB
- Format
- Neznano
- Opis
- Validation split of COPA-SR-TOR
- MD5
- 9ef13b5ab79bd9e2965f16147fea1816