dc.contributor.author | Kopp, Matyáš |
dc.contributor.author | Ljubešić, Nikola |
dc.date.accessioned | 2024-08-09T09:01:00Z |
dc.date.available | 2024-08-09T09:01:00Z |
dc.date.issued | 2024-07-24 |
dc.identifier.uri | http://hdl.handle.net/11356/1785 |
dc.description | The ParlaSpeech-CZ dataset is built from the transcripts of parliamentary proceedings available in the Czech part of the ParlaMint corpus, and the parliamentary recordings available from the AudioPSP dataset (http://hdl.handle.net/11234/1-5404). The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key. Different to other ParlaSpeech datasets, each instance in this dataset has an additional "sentence_id" key referring to the ParlaMint sentence ID, and an additional "id" key in the description of each word referring to the ParlaMint word ID. Namely, in this dataset original ParlaMint sentence and word segmentation was kept due to a different, centralised processing approach. Additionally, the "audio_source" key is also available, pointing at the original audio recording from the AudioPSP dataset. |
dc.language.iso | ces |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://aclanthology.org/2022.parlaclarin-1.16 |
dc.relation.isreferencedby | https://link.springer.com/chapter/10.1007/978-3-030-83527-9_25 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.clarin.eu/parlamint |
dc.subject | parliamentary debates |
dc.subject | speech recordings |
dc.subject | speech database |
dc.subject | speech recognition |
dc.subject | automatic speech recognition |
dc.subject | speech transcription |
dc.title | Parliamentary spoken corpus of Czech ParlaSpeech-CZ 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | audio |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://huggingface.co/datasets/classla/ParlaSpeech-CZ |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | CLARIN ERIC - ParlaMint: Towards Comparable Parliamentary Corpora Other |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | Ministry of Education, Youth and Sports of the Czech Republic LM2023062 LINDAT/CLARIAH-CZ: Digital Research Infrastructure for Language Technologies, Arts and Humanities nationalFunds |
size.info | 717682 units |
size.info | 4385505 seconds |
size.info | 1218 hours |
files.count | 5 |
files.size | 164100257591 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- ParlaSpeech-CZ.v1.0.jsonl.gz
- Size
- 199.84 MB
- Format
- application/gzip
- Description
- Corpus text in gzipped JSON Lines format
- MD5
- 61143e9e21e24cc09f773742ce47d4f6

- Name
- ParlaSpeech-CZ.v1.0.part1.tgz
- Size
- 46.33 GB
- Format
- Unknown
- Description
- Speech in FLAC format, part 1
- MD5
- e6fbbae9d0327f08d9b832b4822c9976

- Name
- ParlaSpeech-CZ.v1.0.part2.tgz
- Size
- 40.61 GB
- Format
- Unknown
- Description
- Speech in FLAC format, part 2
- MD5
- 91b03c9b50b52c04c8ec9529ce83d33b

- Name
- ParlaSpeech-CZ.v1.0.part3.tgz
- Size
- 43.62 GB
- Format
- Unknown
- Description
- Speech in FLAC format, part 3
- MD5
- 30a057149006c86575d889409de88631

- Name
- ParlaSpeech-CZ.v1.0.part4.tgz
- Size
- 22.07 GB
- Format
- Unknown
- Description
- Speech in FLAC format, part 4
- MD5
- 58a37e7fcb1309bc9c20b9b46155036a