Parliamentary spoken corpus of Serbian ParlaSpeech-RS 1.0

Name: Parliamentary spoken corpus of Serbian ParlaSpeech-RS 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Rupnik, Peter; Koržinek, Danijel

Parliamentary spoken corpus of Serbian ParlaSpeech-RS 1.0

CLARIN.SI data & tools

Authors: Ljubešić, Nikola ; Rupnik, Peter and Koržinek, Danijel

Item identifier: http://hdl.handle.net/11356/1834

Project URL: https://www.clarin.eu/parlamint

Demo URL: https://huggingface.co/datasets/classla/ParlaSpeech-RS

Referenced by: https://doi.org/10.1007/978-3-031-77961-9_10

Date issued: 2024-02-08

Type: audio, corpus

Size: 290778 entries, 3226388 seconds, 896 hours

Language(s): Serbian

Description: The ParlaSpeech-RS dataset is built from the transcripts of parliamentary proceedings available in the Serbian part of the ParlaMint (ParlaMint-RS) corpus, and the parliamentary recordings available from the Serbian Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.

Publisher: Jožef Stefan Institute

Subject(s): parliamentary debates speech recordings speech database speech recognition automatic speech recognition speech transcription

Collection(s): CLARIN.SI data & tools

Show full item record

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: ParlaSpeech-RS.v1.0.jsonl.gz
Size: 102.73 MB
Format: application/gzip
Description: Corpus text in gzipped JSON Lines format
MD5: 4b83f759fabd6d0dcb1bf391090b2143

Download file

Name: ParlaSpeech-RS.v1.0.part1.tgz
Size: 36.41 GB
Format: Unknown
Description: Speech in FLAC format, part 1
MD5: 83ff0608114a8c2701f712112ce88f03

Download file

Name: ParlaSpeech-RS.v1.0.part2.tgz
Size: 26.62 GB
Format: Unknown
Description: Speech in FLAC format, part 2
MD5: 628efb94708a9e10d02fd825ac853a4c

Download file

Name: README.txt
Size: 1 KB
Format: Text file
Description: Description of the corpus format
MD5: dc33d4dd9eb8d6b8a29a28fd1ed309cf

Download file Preview

File Preview

Parliamentary spoken corpus of Serbian ParlaSpeech-RS v1.0
http://hdl.handle.net/11356/1834

The ParlaSpeech-RS.v1.0.jsonl (JSON lines) file consists of entries with the following attributes:

id: ParlaMint utterance ID with zero-based character offsets pointing to the specific part of the utterance
words: List of character and milisecond offsets to specific words in the trasncript, especially useful for further segmentation of each entry
audio: path to the FLAC file (available from the part*.tgz files), the folder name corresponding to the YouTube video ID
audio_length: length of the recording in seconds
text: transcript of the audio
text_start: starting character position in the original ParlaMint 4.0 utterance
text_end: ending character position in the original ParlaMint 4.0 utterance
audio_start: starting milisecond position in the original YouTube video
audio_end: ending milisecond position in the original YouTube video
speaker_info: full information on the speaker (and speech) fr . . .

Parliamentary spoken corpus of Serbian ParlaSpeech-RS 1.0

Files in this item

Partners

Partners

Repository