| dc.contributor.author | Kuzman Pungeršek, Taja |
| dc.contributor.author | Ljubešić, Nikola |
| dc.date.accessioned | 2026-02-19T20:46:59Z |
| dc.date.available | 2026-02-19T20:46:59Z |
| dc.date.issued | 2026-02-20 |
| dc.identifier.uri | http://hdl.handle.net/11356/2093 |
| dc.description | The multilingual training dataset for CAP policy topic classification ParlaCAP-train is a collection of parliamentary speeches in 29 European languages, automatically annotated with 21 major policy topic labels from the Comparative Agendas Project (CAP) schema, along with an additional label "Other". The texts were annotated by the GPT-4o large language model, accessed via the OpenAI API (https://openai.com/index/hello-gpt-4o/). Evaluation against manually-annotated test datasets in Bosnian, Croatian, English, and Serbian shows strong performance of the GPT-4o model, with macro-F1 scores ranging from 0.63 to 0.74. Moreover, inter-annotator agreement analysis on the test set indicates that the reliability of the LLM as an annotator is comparable to that of human annotators. The ParlaCAP-train dataset contains 35,579 parliamentary speeches, comprising a total of 5,341,147 words. The instances were drawn from all 29 national and regional parliamentary corpora included in the ParlaMint 4.1 collection (http://hdl.handle.net/11356/1912). For each parliament, a random sample of 1,200 speeches was selected. The dataset is divided into a training set (29,779 instances) and a development set (5,800 instances). The training set includes 1,000 speeches from each ParlaMint corpus, along with 779 additional Public Lands instances to improve the representation of this otherwise rare category. The development set contains 200 speeches per ParlaMint corpus. The splits are stratified by topic label. The dataset uses the 21 major CAP policy topics (https://www.comparativeagendas.net/pages/master-codebook): "Agriculture", "Civil Rights", "Culture", "Defense", "Domestic Commerce", "Education", "Energy", "Environment", "Foreign Trade", "Government Operations", "Health", "Housing", "Immigration", "International Affairs", "Labor", "Law and Crime", "Macroeconomics", "Public Lands", "Social Welfare", "Technology", "Transportation", along with an additional category, "Other". Detailed label descriptions and information on their distribution in the dataset are provided in the accompanying README file. The ParlaCAP-train dataset is distributed in JSONL format. Each entry includes the speech text and the following metadata: document and speech ID, session date, source parliamentary corpus, speaker name and role, speech length, predicted CAP category, data split (train or dev), and an indicator specifying whether the instance was added during the augmentation step to increase the representation of the Public Lands label. Further details on the file format and metadata fields are provided in the README file. This dataset was used to develop the ParlaCAP policy topic classifier (https://www.doi.org/10.57967/hf/6684), a fine-tuned Transformer-based XLM-R-Parla BERT-like model (https://huggingface.co/classla/xlm-r-parla). The classifier can be applied to any language covered by the XLM-RoBERTa pretraining corpus. |
| dc.language.iso | eus |
| dc.language.iso | bos |
| dc.language.iso | bul |
| dc.language.iso | cat |
| dc.language.iso | hrv |
| dc.language.iso | ces |
| dc.language.iso | dan |
| dc.language.iso | nld |
| dc.language.iso | eng |
| dc.language.iso | est |
| dc.language.iso | fin |
| dc.language.iso | fra |
| dc.language.iso | glg |
| dc.language.iso | deu |
| dc.language.iso | hun |
| dc.language.iso | isl |
| dc.language.iso | ita |
| dc.language.iso | lav |
| dc.language.iso | ell |
| dc.language.iso | nor |
| dc.language.iso | pol |
| dc.language.iso | por |
| dc.language.iso | rus |
| dc.language.iso | srp |
| dc.language.iso | slv |
| dc.language.iso | spa |
| dc.language.iso | swe |
| dc.language.iso | tur |
| dc.language.iso | ukr |
| dc.publisher | Jožef Stefan Institute |
| dc.relation.isreferencedby | https://doi.org/10.48550/arXiv.2602.16516 |
| dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
| dc.rights.label | PUB |
| dc.source.uri | https://clarinsi.github.io/parlacap/ |
| dc.subject | parliamentary debates |
| dc.subject | UK Parliament |
| dc.subject | Portuguese Parliament |
| dc.subject | Basque Parliament |
| dc.subject | Estonian Parliament |
| dc.subject | Spanish Parliament |
| dc.subject | Finnish Parliament |
| dc.subject | Ukrainian Parliament |
| dc.subject | Swedish Parliament |
| dc.subject | Serbian Parliament |
| dc.subject | Norwegian Parliament |
| dc.subject | Greek Parliament |
| dc.subject | Galician Parliament |
| dc.subject | Catalonian Parliament |
| dc.subject | Bosnian Parliament |
| dc.subject | Austrian Parliament |
| dc.subject | French Parliament |
| dc.subject | Slovenian Parliament |
| dc.subject | Polish Parliament |
| dc.subject | Croatian Parliament |
| dc.subject | Bulgarian Parliament |
| dc.subject | Latvian Parliament |
| dc.subject | Hungarian Parliament |
| dc.subject | Italian Parliament |
| dc.subject | Turkish Parliament |
| dc.subject | Dutch Parliament |
| dc.subject | Danish Parliament |
| dc.subject | Belgian Parliament |
| dc.subject | Icelandic Parliament |
| dc.subject | Czech Parliament |
| dc.subject | topic classification |
| dc.subject | topic |
| dc.subject | Comparative Agendas Project |
| dc.subject | CAP |
| dc.subject | CAP topic |
| dc.subject | policy topic |
| dc.subject | agenda-setting |
| dc.title | Multilingual training dataset for CAP policy topic classification ParlaCAP-train |
| dc.type | corpus |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
| sponsor | OSCARS project ParlaCap Comparing agenda settings across parliaments via the ParlaMint dataset Other |
| sponsor | ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds |
| sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
| sponsor | Institute of Contemporary History DARIAH DARIAH-SI nationalFunds |
| size.info | 35579 utterances |
| size.info | 5341147 words |
| files.count | 1 |
| files.size | 20465646 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- ParlaCAP-train.zip
- Size
- 19.52 MB
- Format
- application/zip
- Description
- The ParlaCAP-train.jsonl dataset and the accompanying README.md file
- MD5
- c3000cda6677615d1056e31e31c8a05a
- ParlaCAP-train
- README.md22 kB
- ParlaCAP-train.jsonl47 MB