Prikaži enostavni zapis vnosa

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Koržinek, Danijel
dc.contributor.author Rupnik, Peter
dc.contributor.author Jazbec, Ivo-Pavao
dc.contributor.author Batanović, Vuk
dc.contributor.author Bajčetić, Lenka
dc.contributor.author Evkoski, Bojan
dc.date.accessioned 2022-04-13T14:30:52Z
dc.date.available 2022-04-13T14:30:52Z
dc.date.issued 2022-04-04
dc.identifier.uri http://hdl.handle.net/11356/1494
dc.description The ParlaSpeech-HR dataset is built from parliamentary proceedings available in the Croatian part of the ParlaMint corpus and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of segments 8-20 seconds in length. There are two transcripts available: the original one, and the one normalised via a simple rule-based normaliser. Each of the transcripts contains word-level alignments to the recordings. Each segment has a reference to the ParlaMint 2.1 corpus (http://hdl.handle.net/11356/1432) via utterance IDs. If a segment is based on a single utterance, speaker information for that segment is available as well. There is speaker information available for 381,849 segments, i.e., 95% of all segments. Speaker information consists of all the speaker information available from the ParlaMint 2.1 corpus (name, party, gender, age, status, role). There are all together 309 speakers in the dataset. The dataset is divided into a training, a development, and a testing subset. Development data consist of 500 segments coming from the 5 most frequent speakers, with the goal of not losing speaker variety on dev data. Test data consist of 513 segments that come from 3 male (258 segments) and 3 female speakers (255 segments). There are no segments coming from the 6 test speakers in the two remaining subsets. The 22,076 instances not having speaker information are not assigned to any of the three subsets. The remaining 380,836 instances form the training set.
dc.language.iso hrv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby https://aclanthology.org/2022.parlaclarin-1.16
dc.relation.isreplacedby http://hdl.handle.net/11356/1914
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.clarin.eu/content/parlamint-towards-comparable-parliamentary-corpora
dc.subject parliamentary debates
dc.subject speech recordings
dc.subject speech database
dc.subject speech recognition
dc.subject automatic speech recognition
dc.subject speech transcription
dc.title ASR training dataset for Croatian ParlaSpeech-HR v1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType audio
hidden hidden
has.files yes
branding CLARIN.SI data & tools
demo.uri https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor CLARIN ERIC - ParlaMint: Towards Comparable Parliamentary Corpora Other
sponsor ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
size.info 403925 entries
size.info 6538823 seconds
size.info 1816 hours
files.count 5
files.size 125891165699


 Datoteke v tem vnosu

Icon
Ime
ParlaSpeech-HR.v1.0.jsonl
Velikost
679.52 MB
Format
Neznano
Opis
Corpus in JSON Lines format
MD5
271ef6589623facd86527b1e05b740f4
 Prenesi datoteko
Icon
Ime
ParlaSpeech-HR.v1.0.txt
Velikost
1.07 KB
Format
Besedilna datoteka
Opis
README
MD5
71a7479a87e107510c99bc2602e1076e
 Prenesi datoteko  Predogled
 Predogled datoteke  
ASR training dataset for Croatian ParlaSpeech-HR v1.0
http://hdl.handle.net/11356/1494

The ParlaSpeech-HR.v1.0.jsonl (json lines) file consists of entries with the following attributes:

path: name of the file with the segment recording
orig_file: name of the original file harvested from YouTube
start: second when the segment starts in the original file
end: second when the segment ends in the original file
words: list of words from the original transcript
word_start_times: relative time references (in seconds) to each word
norm_words: list of words normalized with an imperfect rule-based normaliser
norm_words_start_times: relative time references (in seconds) to each word in the normalized transcript
utterance_id_start: ID of the utterance in the ParlaMint 2.1 corpus (http://hdl.handle.net/11356/1432) where the segment starts
utterance_id_end: ID of the utterance in the ParlaMint 2.1 corpus where the segment ends
speaker_info: list of speaker attributes from ParlaMint 2.1, if single . . .
                                            
Icon
Ime
ParlaSpeech-HR.flac.tgz.0
Velikost
48.83 GB
Format
Neznano
Opis
Speech in FLAC format, slice 0
MD5
84076b62f51eb1da9870c1f6c4da436b
 Prenesi datoteko
Icon
Ime
ParlaSpeech-HR.flac.tgz.1
Velikost
48.83 GB
Format
Neznano
Opis
Speech in FLAC format, slice 1
MD5
8123e76721d437837a2439dd662a973b
 Prenesi datoteko
Icon
Ime
ParlaSpeech-HR.flac.tgz.2
Velikost
18.93 GB
Format
Neznano
Opis
Speech in FLAC format, slice 2
MD5
cd8e71d1d93a3b89d10a208c288c824e
 Prenesi datoteko

Prikaži enostavni zapis vnosa