ASR training dataset for Croatian ParlaSpeech-HR v1.0

Name: ASR training dataset for Croatian ParlaSpeech-HR v1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Koržinek, Danijel; Rupnik, Peter; Jazbec, Ivo-Pavao; Batanović, Vuk; Bajčetić, Lenka; Evkoski, Bojan

Show simple item record

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Koržinek, Danijel
dc.contributor.author	Rupnik, Peter
dc.contributor.author	Jazbec, Ivo-Pavao
dc.contributor.author	Batanović, Vuk
dc.contributor.author	Bajčetić, Lenka
dc.contributor.author	Evkoski, Bojan
dc.date.accessioned	2022-04-13T14:30:52Z
dc.date.available	2022-04-13T14:30:52Z
dc.date.issued	2022-04-04
dc.identifier.uri	http://hdl.handle.net/11356/1494
dc.description	The ParlaSpeech-HR dataset is built from parliamentary proceedings available in the Croatian part of the ParlaMint corpus and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of segments 8-20 seconds in length. There are two transcripts available: the original one, and the one normalised via a simple rule-based normaliser. Each of the transcripts contains word-level alignments to the recordings. Each segment has a reference to the ParlaMint 2.1 corpus (http://hdl.handle.net/11356/1432) via utterance IDs. If a segment is based on a single utterance, speaker information for that segment is available as well. There is speaker information available for 381,849 segments, i.e., 95% of all segments. Speaker information consists of all the speaker information available from the ParlaMint 2.1 corpus (name, party, gender, age, status, role). There are all together 309 speakers in the dataset. The dataset is divided into a training, a development, and a testing subset. Development data consist of 500 segments coming from the 5 most frequent speakers, with the goal of not losing speaker variety on dev data. Test data consist of 513 segments that come from 3 male (258 segments) and 3 female speakers (255 segments). There are no segments coming from the 6 test speakers in the two remaining subsets. The 22,076 instances not having speaker information are not assigned to any of the three subsets. The remaining 380,836 instances form the training set.
dc.language.iso	hrv
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://aclanthology.org/2022.parlaclarin-1.16
dc.relation.isreplacedby	http://hdl.handle.net/11356/1914
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.clarin.eu/content/parlamint-towards-comparable-parliamentary-corpora
dc.subject	parliamentary debates
dc.subject	speech recordings
dc.subject	speech database
dc.subject	speech recognition
dc.subject	automatic speech recognition
dc.subject	speech transcription
dc.title	ASR training dataset for Croatian ParlaSpeech-HR v1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	audio
hidden	hidden
has.files	yes
branding	CLARIN.SI data & tools
demo.uri	https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	CLARIN ERIC - ParlaMint: Towards Comparable Parliamentary Corpora Other
sponsor	ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
size.info	403925 entries
size.info	6538823 seconds
size.info	1816 hours
files.count	5
files.size	125891165699

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: ParlaSpeech-HR.v1.0.jsonl
Size: 679.52 MB
Format: Unknown
Description: Corpus in JSON Lines format
MD5: 271ef6589623facd86527b1e05b740f4

Download file

Name: ParlaSpeech-HR.v1.0.txt
Size: 1.07 KB
Format: Text file
Description: README
MD5: 71a7479a87e107510c99bc2602e1076e

Download file Preview

File Preview

ASR training dataset for Croatian ParlaSpeech-HR v1.0
http://hdl.handle.net/11356/1494

The ParlaSpeech-HR.v1.0.jsonl (json lines) file consists of entries with the following attributes:

path: name of the file with the segment recording
orig_file: name of the original file harvested from YouTube
start: second when the segment starts in the original file
end: second when the segment ends in the original file
words: list of words from the original transcript
word_start_times: relative time references (in seconds) to each word
norm_words: list of words normalized with an imperfect rule-based normaliser
norm_words_start_times: relative time references (in seconds) to each word in the normalized transcript
utterance_id_start: ID of the utterance in the ParlaMint 2.1 corpus (http://hdl.handle.net/11356/1432) where the segment starts
utterance_id_end: ID of the utterance in the ParlaMint 2.1 corpus where the segment ends
speaker_info: list of speaker attributes from ParlaMint 2.1, if single . . .

Name: ParlaSpeech-HR.flac.tgz.0
Size: 48.83 GB
Format: Unknown
Description: Speech in FLAC format, slice 0
MD5: 84076b62f51eb1da9870c1f6c4da436b

Download file

Name: ParlaSpeech-HR.flac.tgz.1
Size: 48.83 GB
Format: Unknown
Description: Speech in FLAC format, slice 1
MD5: 8123e76721d437837a2439dd662a973b

Download file

Name: ParlaSpeech-HR.flac.tgz.2
Size: 18.93 GB
Format: Unknown
Description: Speech in FLAC format, slice 2
MD5: cd8e71d1d93a3b89d10a208c288c824e

Download file

Show simple item record

Files in this item

Partners

Partners

Repository