Prikaži enostavni zapis vnosa

 
dc.contributor.author Kopp, Matyáš
dc.contributor.author Ljubešić, Nikola
dc.date.accessioned 2024-08-09T09:01:00Z
dc.date.available 2024-08-09T09:01:00Z
dc.date.issued 2024-07-24
dc.identifier.uri http://hdl.handle.net/11356/1785
dc.description The ParlaSpeech-CZ dataset is built from the transcripts of parliamentary proceedings available in the Czech part of the ParlaMint corpus, and the parliamentary recordings available from the AudioPSP dataset (http://hdl.handle.net/11234/1-5404). The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key. Different to other ParlaSpeech datasets, each instance in this dataset has an additional "sentence_id" key referring to the ParlaMint sentence ID, and an additional "id" key in the description of each word referring to the ParlaMint word ID. Namely, in this dataset original ParlaMint sentence and word segmentation was kept due to a different, centralised processing approach. Additionally, the "audio_source" key is also available, pointing at the original audio recording from the AudioPSP dataset.
dc.language.iso ces
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://aclanthology.org/2022.parlaclarin-1.16
dc.relation.isreferencedby https://link.springer.com/chapter/10.1007/978-3-030-83527-9_25
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.clarin.eu/parlamint
dc.subject parliamentary debates
dc.subject speech recordings
dc.subject speech database
dc.subject speech recognition
dc.subject automatic speech recognition
dc.subject speech transcription
dc.title Parliamentary spoken corpus of Czech ParlaSpeech-CZ 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType audio
has.files yes
branding CLARIN.SI data & tools
demo.uri https://huggingface.co/datasets/classla/ParlaSpeech-CZ
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor CLARIN ERIC - ParlaMint: Towards Comparable Parliamentary Corpora Other
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor Ministry of Education, Youth and Sports of the Czech Republic LM2023062 LINDAT/CLARIAH-CZ: Digital Research Infrastructure for Language Technologies, Arts and Humanities nationalFunds
size.info 717682 units
size.info 4385505 seconds
size.info 1218 hours
files.count 5
files.size 164100257591


 Datoteke v tem vnosu

Icon
Ime
ParlaSpeech-CZ.v1.0.jsonl.gz
Velikost
199.84 MB
Format
application/gzip
Opis
Corpus text in gzipped JSON Lines format
MD5
61143e9e21e24cc09f773742ce47d4f6
 Prenesi datoteko
Icon
Ime
ParlaSpeech-CZ.v1.0.part1.tgz
Velikost
46.33 GB
Format
Neznano
Opis
Speech in FLAC format, part 1
MD5
e6fbbae9d0327f08d9b832b4822c9976
 Prenesi datoteko
Icon
Ime
ParlaSpeech-CZ.v1.0.part2.tgz
Velikost
40.61 GB
Format
Neznano
Opis
Speech in FLAC format, part 2
MD5
91b03c9b50b52c04c8ec9529ce83d33b
 Prenesi datoteko
Icon
Ime
ParlaSpeech-CZ.v1.0.part3.tgz
Velikost
43.62 GB
Format
Neznano
Opis
Speech in FLAC format, part 3
MD5
30a057149006c86575d889409de88631
 Prenesi datoteko
Icon
Ime
ParlaSpeech-CZ.v1.0.part4.tgz
Velikost
22.07 GB
Format
Neznano
Opis
Speech in FLAC format, part 4
MD5
58a37e7fcb1309bc9c20b9b46155036a
 Prenesi datoteko

Prikaži enostavni zapis vnosa