Parliamentary spoken corpus of Czech ParlaSpeech-CZ 1.0

Name: Parliamentary spoken corpus of Czech ParlaSpeech-CZ 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Kopp, Matyáš; Ljubešić, Nikola

Show simple item record

dc.contributor.author	Kopp, Matyáš
dc.contributor.author	Ljubešić, Nikola
dc.date.accessioned	2024-08-09T09:01:00Z
dc.date.available	2024-08-09T09:01:00Z
dc.date.issued	2024-07-24
dc.identifier.uri	http://hdl.handle.net/11356/1785
dc.description	The ParlaSpeech-CZ dataset is built from the transcripts of parliamentary proceedings available in the Czech part of the ParlaMint corpus, and the parliamentary recordings available from the AudioPSP dataset (http://hdl.handle.net/11234/1-5404). The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key. Different to other ParlaSpeech datasets, each instance in this dataset has an additional "sentence_id" key referring to the ParlaMint sentence ID, and an additional "id" key in the description of each word referring to the ParlaMint word ID. Namely, in this dataset original ParlaMint sentence and word segmentation was kept due to a different, centralised processing approach. Additionally, the "audio_source" key is also available, pointing at the original audio recording from the AudioPSP dataset.
dc.language.iso	ces
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://aclanthology.org/2022.parlaclarin-1.16
dc.relation.isreferencedby	https://link.springer.com/chapter/10.1007/978-3-030-83527-9_25
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.clarin.eu/parlamint
dc.subject	parliamentary debates
dc.subject	speech recordings
dc.subject	speech database
dc.subject	speech recognition
dc.subject	automatic speech recognition
dc.subject	speech transcription
dc.title	Parliamentary spoken corpus of Czech ParlaSpeech-CZ 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	audio
has.files	yes
branding	CLARIN.SI data & tools
demo.uri	https://huggingface.co/datasets/classla/ParlaSpeech-CZ
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	CLARIN ERIC - ParlaMint: Towards Comparable Parliamentary Corpora Other
sponsor	ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	Ministry of Education, Youth and Sports of the Czech Republic LM2023062 LINDAT/CLARIAH-CZ: Digital Research Infrastructure for Language Technologies, Arts and Humanities nationalFunds
size.info	717682 units
size.info	4385505 seconds
size.info	1218 hours
files.count	5
files.size	164100257591