dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Koržinek, Danijel |
dc.contributor.author | Rupnik, Peter |
dc.contributor.author | Jazbec, Ivo-Pavao |
dc.contributor.author | Batanović, Vuk |
dc.contributor.author | Bajčetić, Lenka |
dc.contributor.author | Evkoski, Bojan |
dc.date.accessioned | 2022-04-13T14:30:52Z |
dc.date.available | 2022-04-13T14:30:52Z |
dc.date.issued | 2022-04-04 |
dc.identifier.uri | http://hdl.handle.net/11356/1494 |
dc.description | The ParlaSpeech-HR dataset is built from parliamentary proceedings available in the Croatian part of the ParlaMint corpus and the parliamentary recordings available from the Croatian Parliament's YouTube channel. The corpus consists of segments 8-20 seconds in length. There are two transcripts available: the original one, and the one normalised via a simple rule-based normaliser. Each of the transcripts contains word-level alignments to the recordings. Each segment has a reference to the ParlaMint 2.1 corpus (http://hdl.handle.net/11356/1432) via utterance IDs. If a segment is based on a single utterance, speaker information for that segment is available as well. There is speaker information available for 381,849 segments, i.e., 95% of all segments. Speaker information consists of all the speaker information available from the ParlaMint 2.1 corpus (name, party, gender, age, status, role). There are all together 309 speakers in the dataset. The dataset is divided into a training, a development, and a testing subset. Development data consist of 500 segments coming from the 5 most frequent speakers, with the goal of not losing speaker variety on dev data. Test data consist of 513 segments that come from 3 male (258 segments) and 3 female speakers (255 segments). There are no segments coming from the 6 test speakers in the two remaining subsets. The 22,076 instances not having speaker information are not assigned to any of the three subsets. The remaining 380,836 instances form the training set. |
dc.language.iso | hrv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://aclanthology.org/2022.parlaclarin-1.16 |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1914 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.clarin.eu/content/parlamint-towards-comparable-parliamentary-corpora |
dc.subject | parliamentary debates |
dc.subject | speech recordings |
dc.subject | speech database |
dc.subject | speech recognition |
dc.subject | automatic speech recognition |
dc.subject | speech transcription |
dc.title | ASR training dataset for Croatian ParlaSpeech-HR v1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | audio |
hidden | hidden |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | CLARIN ERIC - ParlaMint: Towards Comparable Parliamentary Corpora Other |
sponsor | ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds |
size.info | 403925 entries |
size.info | 6538823 seconds |
size.info | 1816 hours |
files.count | 5 |
files.size | 125891165699 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Name
- ParlaSpeech-HR.v1.0.jsonl
- Size
- 679.52 MB
- Format
- Unknown
- Description
- Corpus in JSON Lines format
- MD5
- 271ef6589623facd86527b1e05b740f4
- Name
- ParlaSpeech-HR.v1.0.txt
- Size
- 1.07 KB
- Format
- Text file
- Description
- README
- MD5
- 71a7479a87e107510c99bc2602e1076e
ASR training dataset for Croatian ParlaSpeech-HR v1.0 http://hdl.handle.net/11356/1494 The ParlaSpeech-HR.v1.0.jsonl (json lines) file consists of entries with the following attributes: path: name of the file with the segment recording orig_file: name of the original file harvested from YouTube start: second when the segment starts in the original file end: second when the segment ends in the original file words: list of words from the original transcript word_start_times: relative time references (in seconds) to each word norm_words: list of words normalized with an imperfect rule-based normaliser norm_words_start_times: relative time references (in seconds) to each word in the normalized transcript utterance_id_start: ID of the utterance in the ParlaMint 2.1 corpus (http://hdl.handle.net/11356/1432) where the segment starts utterance_id_end: ID of the utterance in the ParlaMint 2.1 corpus where the segment ends speaker_info: list of speaker attributes from ParlaMint 2.1, if single . . .
- Name
- ParlaSpeech-HR.flac.tgz.0
- Size
- 48.83 GB
- Format
- Unknown
- Description
- Speech in FLAC format, slice 0
- MD5
- 84076b62f51eb1da9870c1f6c4da436b
- Name
- ParlaSpeech-HR.flac.tgz.1
- Size
- 48.83 GB
- Format
- Unknown
- Description
- Speech in FLAC format, slice 1
- MD5
- 8123e76721d437837a2439dd662a973b
- Name
- ParlaSpeech-HR.flac.tgz.2
- Size
- 18.93 GB
- Format
- Unknown
- Description
- Speech in FLAC format, slice 2
- MD5
- cd8e71d1d93a3b89d10a208c288c824e