dc.contributor.author | Koržinek, Danijel |
dc.contributor.author | Ljubešić, Nikola |
dc.date.accessioned | 2024-02-02T10:44:22Z |
dc.date.available | 2024-02-02T10:44:22Z |
dc.date.issued | 2024-02-01 |
dc.identifier.uri | http://hdl.handle.net/11356/1686 |
dc.description | The ParlaSpeech-PL dataset is built from the transcripts of parliamentary proceedings available in the Polish part of the ParlaMint corpus, and the parliamentary recordings available from the Polish Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key. |
dc.language.iso | pol |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | https://doi.org/10.1007/978-3-031-77961-9_10 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.clarin.eu/parlamint |
dc.subject | parliamentary debates |
dc.subject | speech recordings |
dc.subject | speech database |
dc.subject | speech recognition |
dc.subject | automatic speech recognition |
dc.subject | speech transcription |
dc.title | Parliamentary spoken corpus of Polish ParlaSpeech-PL 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | audio |
has.files | yes |
branding | CLARIN.SI data & tools |
demo.uri | https://huggingface.co/datasets/classla/ParlaSpeech-PL |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | CLARIN ERIC - ParlaMint: Towards Comparable Parliamentary Corpora Other |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
size.info | 535465 entries |
size.info | 3635354 seconds |
size.info | 1010 hours |
files.count | 4 |
files.size | 63067921190 |
featuredService.kontext | search|https://www.clarin.si/kontext/query?corpname=parlaspeech_pl |
featuredService.noske | search|https://www.clarin.si/ske/#dashboard?corpname=parlaspeech_pl |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Ime
- ParlaSpeech-PL.v1.0.jsonl.gz
- Velikost
- 124.96 MB
- Format
- application/gzip
- Opis
- Corpus text in gzipped JSON Lines format
- MD5
- a186d99cf96f15be6898cf86bc261f34

- Ime
- ParlaSpeech-PL.v1.0.part1.tgz
- Velikost
- 27.87 GB
- Format
- Neznano
- Opis
- Speech in FLAC format, part 1
- MD5
- cbef0242706ee876bd27e7e151c69ba2

- Ime
- ParlaSpeech-PL.v1.0.part2.tgz
- Velikost
- 30.74 GB
- Format
- Neznano
- Opis
- Speech in FLAC format, part 2
- MD5
- 95332c745a4c79a56dcc78bc34b30cb1

- Ime
- README.txt
- Velikost
- 1 KB
- Format
- Besedilna datoteka
- Opis
- Description of the corpus format
- MD5
- 53d3b9c770e2ed6f4cbff71b6d4f267e
Parliamentary spoken corpus of Polish ParlaSpeech-PL v1.0 http://hdl.handle.net/11356/1686 The ParlaSpeech-PL.v1.0.jsonl (JSON lines) file consists of entries with the following attributes: id: ParlaMint utterance ID with zero-based character offsets pointing to the specific part of the utterance words: List of character and milisecond offsets to specific words in the trasncript, especially useful for further segmentation of each entry audio: path to the FLAC file (available from the part*.tgz files), the folder name corresponding to the YouTube video ID audio_length: length of the recording in seconds text: transcript of the audio text_start: starting character position in the original ParlaMint 4.0 utterance text_end: ending character position in the original ParlaMint 4.0 utterance audio_start: starting milisecond position in the original YouTube video audio_end: ending milisecond position in the original YouTube video speaker_info: full information on the speaker (and speech) fro . . .