Spoken corpora of parliamentary debates ParlaSpeech 3.0

Name: Spoken corpora of parliamentary debates ParlaSpeech 3.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Rupnik, Peter; Porupski, Ivan; Kuzman Pungeršek, Taja; Koržinek, Danijel; Kopp, Matyáš

Show simple item record

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Rupnik, Peter
dc.contributor.author	Porupski, Ivan
dc.contributor.author	Kuzman Pungeršek, Taja
dc.contributor.author	Koržinek, Danijel
dc.contributor.author	Kopp, Matyáš
dc.date.accessioned	2025-08-12T12:05:38Z
dc.date.available	2025-08-12T12:05:38Z
dc.date.issued	2025-06-12
dc.identifier.uri	http://hdl.handle.net/11356/1833
dc.description	The ParlaSpeech corpora are built from the transcripts of parliamentary proceedings of Croatian, Serbian, Polish, and Czech parliaments available in the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859), and the parliamentary recordings available from the parliaments' YouTube channels. An instance is a transcript sentence with the corresponding metadata and the aligned audio. This version of the ParlaSpeech corpora does not release the audio files as it covers the same data as the preceding versions, i.e. version 2.0 for HR (http://hdl.handle.net/11356/1914) and version 1.0 for RS (http://hdl.handle.net/11356/1834), PL (http://hdl.handle.net/11356/1686), and CZ (http://hdl.handle.net/11356/1785). This version's main extension are five enrichment layers: * ParlaSpeech-Pause: automatic annotations of filled pauses ("eerm") * ParlaSpeech-Align: precise word- and grapheme-level alignment (HR, RS only) * ParlaSpeech-Stress: Labelled primary stress in multisyllabic words (HR, RS only) * ParlaSpeech-Ling: Universal Dependencies (UD) formatted linguistic annotations (lemma, part-of-speech, syntax, etc.) * ParlaSpeech-Senti: sentiment estimation based on the transcript Data size per parliament is the following: * Croatia (HR): 923k sentences, 3k hours, 324k filled pauses, 11M word stresses * Serbia (RS): 291k sentences, 900 hours, 74k filled pauses, 2M word stresses * Czechia (CZ): 718k sentences, 1.2k hours, 200k filled pauses, no word stresses * Poland (PL): 535k sentences, 1k hours, 200k filled pauses, no word stresses The data are available in the following formats: * JSONL: master format, containing all the data. Distributed as a newline delimited JSON, where each line is a valid JSON serialization. Mostly intended for computerized processing. * VERT: vertical format intended for concordancers with text, links to audio, linguistic annotations, sentiment, filled pauses, and primary word stress (where available). * TextGrid (HR and RS only): word- and grapheme alignment, primary word stress, and filled pauses. This format's intended use is with the Praat software (https://www.fon.hum.uva.nl/praat/) for research and applications in phonetics and other speech-focused disciplines. For a detailed dataset schema description and examples, please see our dedicated website: https://clarinsi.github.io/parlaspeech/.
dc.language.iso	hrv
dc.language.iso	srp
dc.language.iso	pol
dc.language.iso	ces
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://doi.org/10.1007/978-3-031-77961-9_10
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://clarinsi.github.io/parlaspeech/
dc.subject	parliamentary debates
dc.subject	speech transcription
dc.subject	speech database
dc.subject	filled pauses
dc.subject	primary stress
dc.subject	sentiment classification
dc.title	Spoken corpora of parliamentary debates ParlaSpeech 3.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	audio
has.files	yes
branding	CLARIN.SI data & tools
demo.uri	https://clarinsi.github.io/parlaspeech/concordancer/concordancer-guide.html
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
contact.person	Peter Rupnik peter.rupnik@ijs.si Jožef Stefan Institute
sponsor	ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor	ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	Institute of Contemporary History DARIAH DARIAH-SI nationalFunds
size.info	6234 hours
size.info	48409635 words
size.info	2466604 utterances
files.count	10
files.size	10989622112
featuredService.noske	search Croatian corpus\|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_hr
featuredService.noske	search Czech corpus\|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_cz
featuredService.noske	search Polish corpus\|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_pl
featuredService.noske	search Serbian corpus\|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_rs

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: ParlaSpeech-HR.v3.0.jsonl.gz
Size: 2.58 GB
Format: application/gzip
Description: ParlaSpeech-HR in JSONL format
MD5: c4aa1cf98e9d2d49689a6b03a46bffe6

Download file

Name: ParlaSpeech-HR.v3.0.textgrid.tgz
Size: 4.33 GB
Format: Unknown
Description: ParlaSpeech-HR TextGrids (word+grapheme alignment, primary word stress, filled pauses)
MD5: 5111633a8153c946ad1f59edfa9cf482

Download file

Name: ParlaSpeech-HR.v3.0.vert.gz
Size: 259.22 MB
Format: application/gzip
Description: ParlaSpeech-HR in vertical format
MD5: 3909150e66ef7379f374f8f70f978e86

Download file

Name: ParlaSpeech-RS.v3.0.jsonl.gz
Size: 585.96 MB
Format: application/gzip
Description: ParlaSpeech-RS in JSONL format
MD5: 03c6c32f763d77ae97385b75db6abe6a

Download file

Name: ParlaSpeech-RS.v3.0.textgrid.tgz
Size: 1.27 GB
Format: Unknown
Description: ParlaSpeech-RS TextGrids (word+grapheme alignment, primary word stress, filled pauses)
MD5: 01a140f28e351f4709add3a74fb8432e

Download file