dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Rupnik, Peter |
dc.contributor.author | Porupski, Ivan |
dc.contributor.author | Kuzman Pungeršek, Taja |
dc.contributor.author | Koržinek, Danijel |
dc.contributor.author | Kopp, Matyáš |
dc.date.accessioned | 2025-08-12T12:05:38Z |
dc.date.available | 2025-08-12T12:05:38Z |
dc.date.issued | 2025-06-12 |
dc.identifier.uri | http://hdl.handle.net/11356/1833 |
dc.description | The ParlaSpeech corpora are built from the transcripts of parliamentary proceedings of Croatian, Serbian, Polish, and Czech parliaments available in the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859), and the parliamentary recordings available from the parliaments' YouTube channels. An instance is a transcript sentence with the corresponding metadata and the aligned audio. This version of the ParlaSpeech corpora does not release the audio files as it covers the same data as the preceding versions, i.e. version 2.0 for HR (http://hdl.handle.net/11356/1914) and version 1.0 for RS (http://hdl.handle.net/11356/1834), PL (http://hdl.handle.net/11356/1686), and CZ (http://hdl.handle.net/11356/1785). This version's main extension are five enrichment layers: * ParlaSpeech-Pause: automatic annotations of filled pauses ("eerm") * ParlaSpeech-Align: precise word- and grapheme-level alignment (HR, RS only) * ParlaSpeech-Stress: Labelled primary stress in multisyllabic words (HR, RS only) * ParlaSpeech-Ling: Universal Dependencies (UD) formatted linguistic annotations (lemma, part-of-speech, syntax, etc.) * ParlaSpeech-Senti: sentiment estimation based on the transcript Data size per parliament is the following: * Croatia (HR): 923k sentences, 3k hours, 324k filled pauses, 11M word stresses * Serbia (RS): 291k sentences, 900 hours, 74k filled pauses, 2M word stresses * Czechia (CZ): 718k sentences, 1.2k hours, 200k filled pauses, no word stresses * Poland (PL): 535k sentences, 1k hours, 200k filled pauses, no word stresses The data are available in the following formats: * JSONL: master format, containing all the data. Distributed as a newline delimited JSON, where each line is a valid JSON serialization. Mostly intended for computerized processing. * VERT: vertical format intended for concordancers with text, links to audio, linguistic annotations, sentiment, filled pauses, and primary word stress (where available). * TextGrid (HR and RS only): word- and grapheme alignment, primary word stress, and filled pauses. This format's intended use is with the Praat software (https://www.fon.hum.uva.nl/praat/) for research and applications in phonetics and other speech-focused disciplines. For a detailed dataset schema description and examples, please see our dedicated website: https://clarinsi.github.io/parlaspeech/. |
dc.language.iso | hrv |
dc.language.iso | srp |
dc.language.iso | pol |
dc.language.iso | ces |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://doi.org/10.1007/978-3-031-77961-9_10 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://clarinsi.github.io/parlaspeech/ |
dc.subject | parliamentary debates |
dc.subject | speech transcription |
dc.subject | speech database |
dc.subject | filled pauses |
dc.subject | primary stress |
dc.subject | sentiment classification |
dc.title | Spoken corpora of parliamentary debates ParlaSpeech 3.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | audio |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://clarinsi.github.io/parlaspeech/concordancer/concordancer-guide.html |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
contact.person | Peter Rupnik peter.rupnik@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
sponsor | ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | Institute of Contemporary History DARIAH DARIAH-SI nationalFunds |
size.info | 6234 hours |
size.info | 48409635 words |
size.info | 2466604 utterances |
files.count | 10 |
files.size | 10904145977 |
featuredService.noske | search Croatian corpus|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_hr |
featuredService.noske | search Czech corpus|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_cz |
featuredService.noske | search Polish corpus|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_pl |
featuredService.noske | search Serbian corpus|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_rs |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- ParlaSpeech-HR.v3.0.jsonl.gz
- Size
- 2.58 GB
- Format
- application/gzip
- Description
- ParlaSpeech-HR in JSONL format
- MD5
- c4aa1cf98e9d2d49689a6b03a46bffe6

- Name
- ParlaSpeech-HR.v3.0.textgrid.tgz
- Size
- 4.33 GB
- Format
- Unknown
- Description
- ParlaSpeech-HR TextGrids (word+grapheme alignment, primary word stress, filled pauses)
- MD5
- 5111633a8153c946ad1f59edfa9cf482

- Name
- ParlaSpeech-HR.v3.0.vert.gz
- Size
- 259.22 MB
- Format
- application/gzip
- Description
- ParlaSpeech-HR in vertical format
- MD5
- 3909150e66ef7379f374f8f70f978e86

- Name
- ParlaSpeech-RS.v3.0.jsonl.gz
- Size
- 585.96 MB
- Format
- application/gzip
- Description
- ParlaSpeech-RS in JSONL format
- MD5
- 03c6c32f763d77ae97385b75db6abe6a

- Name
- ParlaSpeech-RS.v3.0.textgrid.tgz
- Size
- 1.27 GB
- Format
- Unknown
- Description
- ParlaSpeech-RS TextGrids (word+grapheme alignment, primary word stress, filled pauses)
- MD5
- 01a140f28e351f4709add3a74fb8432e

- Name
- ParlaSpeech-RS.v3.0.vert.gz
- Size
- 72.9 MB
- Format
- application/gzip
- Description
- ParlaSpeech-RS in vertical format
- MD5
- eeec9bfeae72d3c90b53a06431f4816a

- Name
- ParlaSpeech-CZ.v3.0.jsonl.gz
- Size
- 518.98 MB
- Format
- application/gzip
- Description
- ParlaSpeech-CZ in JSONL format
- MD5
- e93410afd47f147b31c0025cd5d3a9f6

- Name
- ParlaSpeech-CZ.v3.0.vert.gz
- Size
- 128.89 MB
- Format
- application/gzip
- Description
- ParlaSpeech-CZ in vertical format
- MD5
- 39d45244706f0f4813649fcdc55cb899

- Name
- ParlaSpeech-PL.v3.0.jsonl.gz
- Size
- 357.65 MB
- Format
- application/gzip
- Description
- ParlaSpeech-PL in JSONL format
- MD5
- caafd1d47f9b2c8dc257417091a705d8

- Name
- ParlaSpeech-PL.v3.0.vert.gz
- Size
- 103.1 MB
- Format
- application/gzip
- Description
- ParlaSpeech-PL in vertical format
- MD5
- 9da192e953e10aaae24f5a35c6b87176