Parliamentary spoken corpus of Polish ParlaSpeech-PL 1.0

Name: Parliamentary spoken corpus of Polish ParlaSpeech-PL 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Koržinek, Danijel; Ljubešić, Nikola

Parliamentary spoken corpus of Polish ParlaSpeech-PL 1.0

CLARIN.SI data & tools

Authors: Koržinek, Danijel and Ljubešić, Nikola

Item identifier: http://hdl.handle.net/11356/1686

Project URL: https://www.clarin.eu/parlamint

Demo URL: https://huggingface.co/datasets/classla/ParlaSpeech-PL

Referenced by: https://doi.org/10.1007/978-3-031-77961-9_10

Date issued: 2024-02-01

Type: audio, corpus

Size: 535465 entries, 3635354 seconds, 1010 hours

Language(s): Polish

Description: The ParlaSpeech-PL dataset is built from the transcripts of parliamentary proceedings available in the Polish part of the ParlaMint corpus, and the parliamentary recordings available from the Polish Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key.

Publisher: Jožef Stefan Institute

Subject(s): parliamentary debates speech recordings speech database speech recognition automatic speech recognition speech transcription

Collection(s): CLARIN.SI data & tools

Show full item record

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: ParlaSpeech-PL.v1.0.jsonl.gz
Size: 124.96 MB
Format: application/gzip
Description: Corpus text in gzipped JSON Lines format
MD5: a186d99cf96f15be6898cf86bc261f34

Download file

Name: ParlaSpeech-PL.v1.0.part1.tgz
Size: 27.87 GB
Format: Unknown
Description: Speech in FLAC format, part 1
MD5: cbef0242706ee876bd27e7e151c69ba2

Download file

Name: ParlaSpeech-PL.v1.0.part2.tgz
Size: 30.74 GB
Format: Unknown
Description: Speech in FLAC format, part 2
MD5: 95332c745a4c79a56dcc78bc34b30cb1

Download file

Name: README.txt
Size: 1 KB
Format: Text file
Description: Description of the corpus format
MD5: 53d3b9c770e2ed6f4cbff71b6d4f267e

Download file Preview

File Preview

Parliamentary spoken corpus of Polish ParlaSpeech-PL v1.0
http://hdl.handle.net/11356/1686

The ParlaSpeech-PL.v1.0.jsonl (JSON lines) file consists of entries with the following attributes:

id: ParlaMint utterance ID with zero-based character offsets pointing to the specific part of the utterance
words: List of character and milisecond offsets to specific words in the trasncript, especially useful for further segmentation of each entry
audio: path to the FLAC file (available from the part*.tgz files), the folder name corresponding to the YouTube video ID
audio_length: length of the recording in seconds
text: transcript of the audio
text_start: starting character position in the original ParlaMint 4.0 utterance
text_end: ending character position in the original ParlaMint 4.0 utterance
audio_start: starting milisecond position in the original YouTube video
audio_end: ending milisecond position in the original YouTube video
speaker_info: full information on the speaker (and speech) fro . . .

Parliamentary spoken corpus of Polish ParlaSpeech-PL 1.0

Files in this item

Partners

Partners

Repository