Show simple item record

 
dc.contributor.author Kuzman Pungeršek, Taja
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2026-02-19T20:46:59Z
dc.date.available 2026-02-19T20:46:59Z
dc.date.issued 2026-02-20
dc.identifier.uri http://hdl.handle.net/11356/2093
dc.description The multilingual training dataset for CAP policy topic classification ParlaCAP-train is a collection of parliamentary speeches in 29 European languages, automatically annotated with 21 major policy topic labels from the Comparative Agendas Project (CAP) schema, along with an additional label "Other". The texts were annotated by the GPT-4o large language model, accessed via the OpenAI API (https://openai.com/index/hello-gpt-4o/). Evaluation against manually-annotated test datasets in Bosnian, Croatian, English, and Serbian shows strong performance of the GPT-4o model, with macro-F1 scores ranging from 0.63 to 0.74. Moreover, inter-annotator agreement analysis on the test set indicates that the reliability of the LLM as an annotator is comparable to that of human annotators. The ParlaCAP-train dataset contains 35,579 parliamentary speeches, comprising a total of 5,341,147 words. The instances were drawn from all 29 national and regional parliamentary corpora included in the ParlaMint 4.1 collection (http://hdl.handle.net/11356/1912). For each parliament, a random sample of 1,200 speeches was selected. The dataset is divided into a training set (29,779 instances) and a development set (5,800 instances). The training set includes 1,000 speeches from each ParlaMint corpus, along with 779 additional Public Lands instances to improve the representation of this otherwise rare category. The development set contains 200 speeches per ParlaMint corpus. The splits are stratified by topic label. The dataset uses the 21 major CAP policy topics (https://www.comparativeagendas.net/pages/master-codebook): "Agriculture", "Civil Rights", "Culture", "Defense", "Domestic Commerce", "Education", "Energy", "Environment", "Foreign Trade", "Government Operations", "Health", "Housing", "Immigration", "International Affairs", "Labor", "Law and Crime", "Macroeconomics", "Public Lands", "Social Welfare", "Technology", "Transportation", along with an additional category, "Other". Detailed label descriptions and information on their distribution in the dataset are provided in the accompanying README file. The ParlaCAP-train dataset is distributed in JSONL format. Each entry includes the speech text and the following metadata: document and speech ID, session date, source parliamentary corpus, speaker name and role, speech length, predicted CAP category, data split (train or dev), and an indicator specifying whether the instance was added during the augmentation step to increase the representation of the Public Lands label. Further details on the file format and metadata fields are provided in the README file. This dataset was used to develop the ParlaCAP policy topic classifier (https://www.doi.org/10.57967/hf/6684), a fine-tuned Transformer-based XLM-R-Parla BERT-like model (https://huggingface.co/classla/xlm-r-parla). The classifier can be applied to any language covered by the XLM-RoBERTa pretraining corpus.
dc.language.iso eus
dc.language.iso bos
dc.language.iso bul
dc.language.iso cat
dc.language.iso hrv
dc.language.iso ces
dc.language.iso dan
dc.language.iso nld
dc.language.iso eng
dc.language.iso est
dc.language.iso fin
dc.language.iso fra
dc.language.iso glg
dc.language.iso deu
dc.language.iso hun
dc.language.iso isl
dc.language.iso ita
dc.language.iso lav
dc.language.iso ell
dc.language.iso nor
dc.language.iso pol
dc.language.iso por
dc.language.iso rus
dc.language.iso srp
dc.language.iso slv
dc.language.iso spa
dc.language.iso swe
dc.language.iso tur
dc.language.iso ukr
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://doi.org/10.48550/arXiv.2602.16516
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://clarinsi.github.io/parlacap/
dc.subject parliamentary debates
dc.subject UK Parliament
dc.subject Portuguese Parliament
dc.subject Basque Parliament
dc.subject Estonian Parliament
dc.subject Spanish Parliament
dc.subject Finnish Parliament
dc.subject Ukrainian Parliament
dc.subject Swedish Parliament
dc.subject Serbian Parliament
dc.subject Norwegian Parliament
dc.subject Greek Parliament
dc.subject Galician Parliament
dc.subject Catalonian Parliament
dc.subject Bosnian Parliament
dc.subject Austrian Parliament
dc.subject French Parliament
dc.subject Slovenian Parliament
dc.subject Polish Parliament
dc.subject Croatian Parliament
dc.subject Bulgarian Parliament
dc.subject Latvian Parliament
dc.subject Hungarian Parliament
dc.subject Italian Parliament
dc.subject Turkish Parliament
dc.subject Dutch Parliament
dc.subject Danish Parliament
dc.subject Belgian Parliament
dc.subject Icelandic Parliament
dc.subject Czech Parliament
dc.subject topic classification
dc.subject topic
dc.subject Comparative Agendas Project
dc.subject CAP
dc.subject CAP topic
dc.subject policy topic
dc.subject agenda-setting
dc.title Multilingual training dataset for CAP policy topic classification ParlaCAP-train
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor OSCARS project ParlaCap Comparing agenda settings across parliaments via the ParlaMint dataset Other
sponsor ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Institute of Contemporary History DARIAH DARIAH-SI nationalFunds
size.info 35579 utterances
size.info 5341147 words
files.count 1
files.size 20465646


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
ParlaCAP-train.zip
Size
19.52 MB
Format
application/zip
Description
The ParlaCAP-train.jsonl dataset and the accompanying README.md file
MD5
c3000cda6677615d1056e31e31c8a05a
 Download file  Preview
 File Preview  

Show simple item record