Parliamentary spoken corpus of Croatian ParlaSpeech-HR 2.0

Name: Parliamentary spoken corpus of Croatian ParlaSpeech-HR 2.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Koržinek, Danijel; Rupnik, Peter

Prikaži enostavni zapis vnosa

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Koržinek, Danijel
dc.contributor.author	Rupnik, Peter
dc.date.accessioned	2024-01-28T11:50:25Z
dc.date.available	2024-01-28T11:50:25Z
dc.date.issued	2024-01-25
dc.identifier.uri	http://hdl.handle.net/11356/1914
dc.description	The ParlaSpeech-HR dataset is built from the transcripts of parliamentary proceedings available in the Croatian part of the ParlaMint corpus, and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key. The main differences to the version 1.0 of the dataset are: - larger size (ParlaMint 4.0 is used here, while previously ParlaMint 2.1 was used) - improved matching pipeline - segments based on linguistically sound sentences from the ParlaMint transcripts, while previously segments surrounded with silence were used
dc.language.iso	hrv
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://doi.org/10.1007/978-3-031-77961-9_10
dc.relation.replaces	http://hdl.handle.net/11356/1494
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.clarin.eu/content/parlamint-towards-comparable-parliamentary-corpora
dc.subject	parliamentary debates
dc.subject	speech recordings
dc.subject	speech database
dc.subject	speech recognition
dc.subject	automatic speech recognition
dc.subject	speech transcription
dc.title	Parliamentary spoken corpus of Croatian ParlaSpeech-HR 2.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	audio
has.files	yes
branding	CLARIN.SI data & tools
demo.uri	https://huggingface.co/datasets/classla/ParlaSpeech-HR
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	CLARIN ERIC - ParlaMint: Towards Comparable Parliamentary Corpora Other
sponsor	ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
size.info	922679 entries
size.info	11019983 seconds
size.info	3061 hours
files.count	8
files.size	222623336523
featuredService.kontext	search\|https://www.clarin.si/kontext/query?corpname=parlaspeech_hr
featuredService.noske	search\|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech_hr

Datoteke v tem vnosu

To je vnos

Publicly Available

z licenco:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Ime: ParlaSpeech-HR.v2.0.jsonl.gz
Velikost: 362.17 MB
Format: application/gzip
Opis: Corpus text in gzipped JSON Lines format
MD5: bfdad5b7a3fc1a5f42e2e00b6fdd999f

Prenesi datoteko

Ime: ParlaSpeech-HR.v2.0.part1.tgz
Velikost: 30.48 GB
Format: Neznano
Opis: Speech in FLAC format, part 1
MD5: 065b28dab675a9fa7b96e4aa2f37418b

Prenesi datoteko

Ime: ParlaSpeech-HR.v2.0.part2.tgz
Velikost: 42.37 GB
Format: Neznano
Opis: Speech in FLAC format, part 2
MD5: 53a37542cfe6e860eefee48caf180d66

Prenesi datoteko

Ime: ParlaSpeech-HR.v2.0.part3.tgz
Velikost: 37.61 GB
Format: Neznano
Opis: Speech in FLAC format, part 3
MD5: e41cc3aa0d8b54c82b3250021ed4bf88

Prenesi datoteko

Ime: ParlaSpeech-HR.v2.0.part4.tgz
Velikost: 41.48 GB
Format: Neznano
Opis: Speech in FLAC format, part 4
MD5: 5b618ca214c3f846f4d1d46386253c18

Prenesi datoteko

Ime: ParlaSpeech-HR.v2.0.part5.tgz
Velikost: 50.13 GB
Format: Neznano
Opis: Speech in FLAC format, part 5
MD5: 9cbc3155cde96d8b9e0359745820febc

Prenesi datoteko

Ime: ParlaSpeech-HR.v2.0.part6.tgz
Velikost: 4.91 GB
Format: Neznano
Opis: Speech in FLAC format, part 6
MD5: ea859bacdbbb236c5b13f4bba6a4122f

Prenesi datoteko

Ime: README.txt
Velikost: 1023 bajtov
Format: Besedilna datoteka
Opis: Description of the corpus format
MD5: 7baa432c16d1480a961fd52ab5a95e97

Prenesi datoteko Predogled

Predogled datoteke

ASR training dataset for Croatian ParlaSpeech-HR v2.0
http://hdl.handle.net/11356/1914

The ParlaSpeech-HR.v2.0.jsonl (JSON lines) file consists of entries with the following attributes:

id: ParlaMint utterance ID with zero-based character offsets pointing to the specific part of the utterance
words: List of character and milisecond offsets to specific words in the trasncript, especially useful for further segmentation of each entry
audio: path to the FLAC file (available from the part*.tgz files), the folder name corresponding to the YouTube video ID
audio_length: length of the recording in seconds
text: transcript of the audio
text_start: starting character position in the original ParlaMint 4.0 utterance
text_end: ending character position in the original ParlaMint 4.0 utterance
audio_start: starting milisecond position in the original YouTube video
audio_end: ending milisecond position in the original YouTube video
speaker_info: full information on the speaker (and speech) from th . . .

Prikaži enostavni zapis vnosa

Datoteke v tem vnosu

Partnerji

Partnerji

Repozitorij