Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Rupnik, Peter
dc.contributor.author Porupski, Ivan
dc.contributor.author Kuzman Pungeršek, Taja
dc.contributor.author Koržinek, Danijel
dc.contributor.author Kopp, Matyáš
dc.date.accessioned 2025-08-12T12:05:38Z
dc.date.available 2025-08-12T12:05:38Z
dc.date.issued 2025-06-12
dc.identifier.uri http://hdl.handle.net/11356/1833
dc.description The ParlaSpeech corpora are built from the transcripts of parliamentary proceedings of Croatian, Serbian, Polish, and Czech parliaments available in the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859), and the parliamentary recordings available from the parliaments' YouTube channels. An instance is a transcript sentence with the corresponding metadata and the aligned audio. This version of the ParlaSpeech corpora does not release the audio files as it covers the same data as the preceding versions, i.e. version 2.0 for HR (http://hdl.handle.net/11356/1914) and version 1.0 for RS (http://hdl.handle.net/11356/1834), PL (http://hdl.handle.net/11356/1686), and CZ (http://hdl.handle.net/11356/1785). This version's main extension are five enrichment layers: * ParlaSpeech-Pause: automatic annotations of filled pauses ("eerm") * ParlaSpeech-Align: precise word- and grapheme-level alignment (HR, RS only) * ParlaSpeech-Stress: Labelled primary stress in multisyllabic words (HR, RS only) * ParlaSpeech-Ling: Universal Dependencies (UD) formatted linguistic annotations (lemma, part-of-speech, syntax, etc.) * ParlaSpeech-Senti: sentiment estimation based on the transcript Data size per parliament is the following: * Croatia (HR): 923k sentences, 3k hours, 324k filled pauses, 11M word stresses * Serbia (RS): 291k sentences, 900 hours, 74k filled pauses, 2M word stresses * Czechia (CZ): 718k sentences, 1.2k hours, 200k filled pauses, no word stresses * Poland (PL): 535k sentences, 1k hours, 200k filled pauses, no word stresses The data are available in the following formats: * JSONL: master format, containing all the data. Distributed as a newline delimited JSON, where each line is a valid JSON serialization. Mostly intended for computerized processing. * VERT: vertical format intended for concordancers with text, links to audio, linguistic annotations, sentiment, filled pauses, and primary word stress (where available). * TextGrid (HR and RS only): word- and grapheme alignment, primary word stress, and filled pauses. This format's intended use is with the Praat software (https://www.fon.hum.uva.nl/praat/) for research and applications in phonetics and other speech-focused disciplines. For a detailed dataset schema description and examples, please see our dedicated website: https://clarinsi.github.io/parlaspeech/.
dc.language.iso hrv
dc.language.iso srp
dc.language.iso pol
dc.language.iso ces
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://doi.org/10.1007/978-3-031-77961-9_10
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://clarinsi.github.io/parlaspeech/
dc.subject parliamentary debates
dc.subject speech transcription
dc.subject speech database
dc.subject filled pauses
dc.subject primary stress
dc.subject sentiment classification
dc.title Spoken corpora of parliamentary debates ParlaSpeech 3.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType audio
has.files yes
branding CLARIN.SI data & tools
demo.uri https://clarinsi.github.io/parlaspeech/concordancer/concordancer-guide.html
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
contact.person Peter Rupnik peter.rupnik@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Institute of Contemporary History DARIAH DARIAH-SI nationalFunds
size.info 6234 hours
size.info 48409635 words
size.info 2466604 utterances
files.count 10
files.size 10904145977
featuredService.noske search Croatian corpus|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_hr
featuredService.noske search Czech corpus|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_cz
featuredService.noske search Polish corpus|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_pl
featuredService.noske search Serbian corpus|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_rs


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
ParlaSpeech-HR.v3.0.jsonl.gz
Size
2.58 GB
Format
application/gzip
Description
ParlaSpeech-HR in JSONL format
MD5
c4aa1cf98e9d2d49689a6b03a46bffe6
 Download file
Icon
Name
ParlaSpeech-HR.v3.0.textgrid.tgz
Size
4.33 GB
Format
Unknown
Description
ParlaSpeech-HR TextGrids (word+grapheme alignment, primary word stress, filled pauses)
MD5
5111633a8153c946ad1f59edfa9cf482
 Download file
Icon
Name
ParlaSpeech-HR.v3.0.vert.gz
Size
259.22 MB
Format
application/gzip
Description
ParlaSpeech-HR in vertical format
MD5
3909150e66ef7379f374f8f70f978e86
 Download file
Icon
Name
ParlaSpeech-RS.v3.0.jsonl.gz
Size
585.96 MB
Format
application/gzip
Description
ParlaSpeech-RS in JSONL format
MD5
03c6c32f763d77ae97385b75db6abe6a
 Download file
Icon
Name
ParlaSpeech-RS.v3.0.textgrid.tgz
Size
1.27 GB
Format
Unknown
Description
ParlaSpeech-RS TextGrids (word+grapheme alignment, primary word stress, filled pauses)
MD5
01a140f28e351f4709add3a74fb8432e
 Download file
Icon
Name
ParlaSpeech-RS.v3.0.vert.gz
Size
72.9 MB
Format
application/gzip
Description
ParlaSpeech-RS in vertical format
MD5
eeec9bfeae72d3c90b53a06431f4816a
 Download file
Icon
Name
ParlaSpeech-CZ.v3.0.jsonl.gz
Size
518.98 MB
Format
application/gzip
Description
ParlaSpeech-CZ in JSONL format
MD5
e93410afd47f147b31c0025cd5d3a9f6
 Download file
Icon
Name
ParlaSpeech-CZ.v3.0.vert.gz
Size
128.89 MB
Format
application/gzip
Description
ParlaSpeech-CZ in vertical format
MD5
39d45244706f0f4813649fcdc55cb899
 Download file
Icon
Name
ParlaSpeech-PL.v3.0.jsonl.gz
Size
357.65 MB
Format
application/gzip
Description
ParlaSpeech-PL in JSONL format
MD5
caafd1d47f9b2c8dc257417091a705d8
 Download file
Icon
Name
ParlaSpeech-PL.v3.0.vert.gz
Size
103.1 MB
Format
application/gzip
Description
ParlaSpeech-PL in vertical format
MD5
9da192e953e10aaae24f5a35c6b87176
 Download file

Show simple item record