Prikaži enostavni zapis vnosa

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Rupnik, Peter
dc.contributor.author Porupski, Ivan
dc.contributor.author Kuzman Pungeršek, Taja
dc.contributor.author Koržinek, Danijel
dc.contributor.author Kopp, Matyáš
dc.date.accessioned 2025-08-12T12:05:38Z
dc.date.available 2025-08-12T12:05:38Z
dc.date.issued 2025-06-12
dc.identifier.uri http://hdl.handle.net/11356/1833
dc.description The ParlaSpeech corpora are built from the transcripts of parliamentary proceedings of Croatian, Serbian, Polish, and Czech parliaments available in the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859), and the parliamentary recordings available from the parliaments' YouTube channels. An instance is a transcript sentence with the corresponding metadata and the aligned audio. This version of the ParlaSpeech corpora does not release the audio files as it covers the same data as the preceding versions, i.e. version 2.0 for HR (http://hdl.handle.net/11356/1914) and version 1.0 for RS (http://hdl.handle.net/11356/1834), PL (http://hdl.handle.net/11356/1686), and CZ (http://hdl.handle.net/11356/1785). This version's main extension are five enrichment layers: * ParlaSpeech-Pause: automatic annotations of filled pauses ("eerm") * ParlaSpeech-Align: precise word- and grapheme-level alignment (HR, RS only) * ParlaSpeech-Stress: Labelled primary stress in multisyllabic words (HR, RS only) * ParlaSpeech-Ling: Universal Dependencies (UD) formatted linguistic annotations (lemma, part-of-speech, syntax, etc.) * ParlaSpeech-Senti: sentiment estimation based on the transcript Data size per parliament is the following: * Croatia (HR): 923k sentences, 3k hours, 324k filled pauses, 11M word stresses * Serbia (RS): 291k sentences, 900 hours, 74k filled pauses, 2M word stresses * Czechia (CZ): 718k sentences, 1.2k hours, 200k filled pauses, no word stresses * Poland (PL): 535k sentences, 1k hours, 200k filled pauses, no word stresses The data are available in the following formats: * JSONL: master format, containing all the data. Distributed as a newline delimited JSON, where each line is a valid JSON serialization. Mostly intended for computerized processing. * VERT: vertical format intended for concordancers with text, links to audio, linguistic annotations, sentiment, filled pauses, and primary word stress (where available). * TextGrid (HR and RS only): word- and grapheme alignment, primary word stress, and filled pauses. This format's intended use is with the Praat software (https://www.fon.hum.uva.nl/praat/) for research and applications in phonetics and other speech-focused disciplines. For a detailed dataset schema description and examples, please see our dedicated website: https://clarinsi.github.io/parlaspeech/.
dc.language.iso hrv
dc.language.iso srp
dc.language.iso pol
dc.language.iso ces
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://doi.org/10.1007/978-3-031-77961-9_10
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://clarinsi.github.io/parlaspeech/
dc.subject parliamentary debates
dc.subject speech transcription
dc.subject speech database
dc.subject filled pauses
dc.subject primary stress
dc.subject sentiment classification
dc.title Spoken corpora of parliamentary debates ParlaSpeech 3.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType audio
has.files yes
branding CLARIN.SI data & tools
demo.uri https://clarinsi.github.io/parlaspeech/concordancer/concordancer-guide.html
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
contact.person Peter Rupnik peter.rupnik@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Institute of Contemporary History DARIAH DARIAH-SI nationalFunds
size.info 6234 hours
size.info 48409635 words
size.info 2466604 utterances
files.count 10
files.size 10904145977
featuredService.noske search Croatian corpus|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_hr
featuredService.noske search Czech corpus|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_cz
featuredService.noske search Polish corpus|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_pl
featuredService.noske search Serbian corpus|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech3_rs


 Datoteke v tem vnosu

Icon
Ime
ParlaSpeech-HR.v3.0.jsonl.gz
Velikost
2.58 GB
Format
application/gzip
Opis
ParlaSpeech-HR in JSONL format
MD5
c4aa1cf98e9d2d49689a6b03a46bffe6
 Prenesi datoteko
Icon
Ime
ParlaSpeech-HR.v3.0.textgrid.tgz
Velikost
4.33 GB
Format
Neznano
Opis
ParlaSpeech-HR TextGrids (word+grapheme alignment, primary word stress, filled pauses)
MD5
5111633a8153c946ad1f59edfa9cf482
 Prenesi datoteko
Icon
Ime
ParlaSpeech-HR.v3.0.vert.gz
Velikost
259.22 MB
Format
application/gzip
Opis
ParlaSpeech-HR in vertical format
MD5
3909150e66ef7379f374f8f70f978e86
 Prenesi datoteko
Icon
Ime
ParlaSpeech-RS.v3.0.jsonl.gz
Velikost
585.96 MB
Format
application/gzip
Opis
ParlaSpeech-RS in JSONL format
MD5
03c6c32f763d77ae97385b75db6abe6a
 Prenesi datoteko
Icon
Ime
ParlaSpeech-RS.v3.0.textgrid.tgz
Velikost
1.27 GB
Format
Neznano
Opis
ParlaSpeech-RS TextGrids (word+grapheme alignment, primary word stress, filled pauses)
MD5
01a140f28e351f4709add3a74fb8432e
 Prenesi datoteko
Icon
Ime
ParlaSpeech-RS.v3.0.vert.gz
Velikost
72.9 MB
Format
application/gzip
Opis
ParlaSpeech-RS in vertical format
MD5
eeec9bfeae72d3c90b53a06431f4816a
 Prenesi datoteko
Icon
Ime
ParlaSpeech-CZ.v3.0.jsonl.gz
Velikost
518.98 MB
Format
application/gzip
Opis
ParlaSpeech-CZ in JSONL format
MD5
e93410afd47f147b31c0025cd5d3a9f6
 Prenesi datoteko
Icon
Ime
ParlaSpeech-CZ.v3.0.vert.gz
Velikost
128.89 MB
Format
application/gzip
Opis
ParlaSpeech-CZ in vertical format
MD5
39d45244706f0f4813649fcdc55cb899
 Prenesi datoteko
Icon
Ime
ParlaSpeech-PL.v3.0.jsonl.gz
Velikost
357.65 MB
Format
application/gzip
Opis
ParlaSpeech-PL in JSONL format
MD5
caafd1d47f9b2c8dc257417091a705d8
 Prenesi datoteko
Icon
Ime
ParlaSpeech-PL.v3.0.vert.gz
Velikost
103.1 MB
Format
application/gzip
Opis
ParlaSpeech-PL in vertical format
MD5
9da192e953e10aaae24f5a35c6b87176
 Prenesi datoteko

Prikaži enostavni zapis vnosa