dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Koržinek, Danijel |
dc.contributor.author | Rupnik, Peter |
dc.date.accessioned | 2024-01-28T11:50:25Z |
dc.date.available | 2024-01-28T11:50:25Z |
dc.date.issued | 2024-01-25 |
dc.identifier.uri | http://hdl.handle.net/11356/1914 |
dc.description | The ParlaSpeech-HR dataset is built from the transcripts of parliamentary proceedings available in the Croatian part of the ParlaMint corpus, and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key. The main differences to the version 1.0 of the dataset are: - larger size (ParlaMint 4.0 is used here, while previously ParlaMint 2.1 was used) - improved matching pipeline - segments based on linguistically sound sentences from the ParlaMint transcripts, while previously segments surrounded with silence were used |
dc.language.iso | hrv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://doi.org/10.1007/978-3-031-77961-9_10 |
dc.relation.replaces | http://hdl.handle.net/11356/1494 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.clarin.eu/content/parlamint-towards-comparable-parliamentary-corpora |
dc.subject | parliamentary debates |
dc.subject | speech recordings |
dc.subject | speech database |
dc.subject | speech recognition |
dc.subject | automatic speech recognition |
dc.subject | speech transcription |
dc.title | Parliamentary spoken corpus of Croatian ParlaSpeech-HR 2.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | audio |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://huggingface.co/datasets/classla/ParlaSpeech-HR |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | CLARIN ERIC - ParlaMint: Towards Comparable Parliamentary Corpora Other |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
size.info | 922679 entries |
size.info | 11019983 seconds |
size.info | 3061 hours |
files.count | 8 |
files.size | 222623336523 |
featuredService.kontext | search|https://www.clarin.si/kontext/query?corpname=parlaspeech_hr |
featuredService.noske | search|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech_hr |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Ime
- ParlaSpeech-HR.v2.0.jsonl.gz
- Velikost
- 362.17 MB
- Format
- application/gzip
- Opis
- Corpus text in gzipped JSON Lines format
- MD5
- bfdad5b7a3fc1a5f42e2e00b6fdd999f

- Ime
- ParlaSpeech-HR.v2.0.part1.tgz
- Velikost
- 30.48 GB
- Format
- Neznano
- Opis
- Speech in FLAC format, part 1
- MD5
- 065b28dab675a9fa7b96e4aa2f37418b

- Ime
- ParlaSpeech-HR.v2.0.part2.tgz
- Velikost
- 42.37 GB
- Format
- Neznano
- Opis
- Speech in FLAC format, part 2
- MD5
- 53a37542cfe6e860eefee48caf180d66

- Ime
- ParlaSpeech-HR.v2.0.part3.tgz
- Velikost
- 37.61 GB
- Format
- Neznano
- Opis
- Speech in FLAC format, part 3
- MD5
- e41cc3aa0d8b54c82b3250021ed4bf88

- Ime
- ParlaSpeech-HR.v2.0.part4.tgz
- Velikost
- 41.48 GB
- Format
- Neznano
- Opis
- Speech in FLAC format, part 4
- MD5
- 5b618ca214c3f846f4d1d46386253c18

- Ime
- ParlaSpeech-HR.v2.0.part5.tgz
- Velikost
- 50.13 GB
- Format
- Neznano
- Opis
- Speech in FLAC format, part 5
- MD5
- 9cbc3155cde96d8b9e0359745820febc

- Ime
- ParlaSpeech-HR.v2.0.part6.tgz
- Velikost
- 4.91 GB
- Format
- Neznano
- Opis
- Speech in FLAC format, part 6
- MD5
- ea859bacdbbb236c5b13f4bba6a4122f

- Ime
- README.txt
- Velikost
- 1023 bajtov
- Format
- Besedilna datoteka
- Opis
- Description of the corpus format
- MD5
- 7baa432c16d1480a961fd52ab5a95e97
ASR training dataset for Croatian ParlaSpeech-HR v2.0 http://hdl.handle.net/11356/1914 The ParlaSpeech-HR.v2.0.jsonl (JSON lines) file consists of entries with the following attributes: id: ParlaMint utterance ID with zero-based character offsets pointing to the specific part of the utterance words: List of character and milisecond offsets to specific words in the trasncript, especially useful for further segmentation of each entry audio: path to the FLAC file (available from the part*.tgz files), the folder name corresponding to the YouTube video ID audio_length: length of the recording in seconds text: transcript of the audio text_start: starting character position in the original ParlaMint 4.0 utterance text_end: ending character position in the original ParlaMint 4.0 utterance audio_start: starting milisecond position in the original YouTube video audio_end: ending milisecond position in the original YouTube video speaker_info: full information on the speaker (and speech) from th . . .