| dc.contributor.author | Ljubešić, Nikola | 
| dc.contributor.author | Rupnik, Peter | 
| dc.contributor.author | Koržinek, Danijel | 
| dc.date.accessioned | 2024-02-08T15:40:33Z | 
| dc.date.available | 2024-02-08T15:40:33Z | 
| dc.date.issued | 2024-02-08 | 
| dc.identifier.uri | http://hdl.handle.net/11356/1834 | 
| dc.description | The ParlaSpeech-RS dataset is built from the transcripts of parliamentary proceedings available in the Serbian part of the ParlaMint (ParlaMint-RS) corpus, and the parliamentary recordings available from the Serbian Parliament's YouTube channel. The corpus consists of audio segments that correspond to specific sentences in the transcripts. The transcript contains word-level alignments to the recordings, allowing for simple further segmentation of long sentences into shorter segments for ASR and other memory-sensitive applications. Each segment has a reference to the ParlaMint 4.0 corpus (http://hdl.handle.net/11356/1859) via utterance IDs and character offsets. All the speaker information from the ParlaMint corpus is available via the "speaker_info" key. | 
| dc.language.iso | srp | 
| dc.publisher | Jožef Stefan Institute | 
| dc.relation.isreferencedby | https://doi.org/10.1007/978-3-031-77961-9_10 | 
| dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) | 
| dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ | 
| dc.rights.label | PUB | 
| dc.source.uri | https://www.clarin.eu/parlamint | 
| dc.subject | parliamentary debates | 
| dc.subject | speech recordings | 
| dc.subject | speech database | 
| dc.subject | speech recognition | 
| dc.subject | automatic speech recognition | 
| dc.subject | speech transcription | 
| dc.title | Parliamentary spoken corpus of Serbian ParlaSpeech-RS 1.0 | 
| dc.type | corpus | 
| metashare.ResourceInfo#ContentInfo.mediaType | audio | 
| has.files | yes | 
| branding | CLARIN.SI data & tools | 
| demo.uri | https://huggingface.co/datasets/classla/ParlaSpeech-RS | 
| contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute | 
| sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds | 
| sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds | 
| sponsor | CLARIN ERIC - ParlaMint: Towards Comparable Parliamentary Corpora Other | 
| sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds | 
| size.info | 290778 entries | 
| size.info | 3226388 seconds | 
| size.info | 896 hours | 
| files.count | 4 | 
| files.size | 67789449157 | 
| featuredService.noske | search|https://www.clarin.si/ske/#concordance?corpname=parlaspeech_rs | 
Datoteke v tem vnosu
To je vnos 
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
 
 
 
Publicly Available
 z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
 
 
 
 
- Ime
- ParlaSpeech-RS.v1.0.jsonl.gz
- Velikost
- 102.73 MB
- Format
- application/gzip
- Opis
- Corpus text in gzipped JSON Lines format
- MD5
- 4b83f759fabd6d0dcb1bf391090b2143
 
- Ime
- ParlaSpeech-RS.v1.0.part1.tgz
- Velikost
- 36.41 GB
- Format
- Neznano
- Opis
- Speech in FLAC format, part 1
- MD5
- 83ff0608114a8c2701f712112ce88f03
 
- Ime
- ParlaSpeech-RS.v1.0.part2.tgz
- Velikost
- 26.62 GB
- Format
- Neznano
- Opis
- Speech in FLAC format, part 2
- MD5
- 628efb94708a9e10d02fd825ac853a4c
 
- Ime
- README.txt
- Velikost
- 1 KB
- Format
- Besedilna datoteka
- Opis
- Description of the corpus format
- MD5
- dc33d4dd9eb8d6b8a29a28fd1ed309cf
Parliamentary spoken corpus of Serbian ParlaSpeech-RS v1.0
http://hdl.handle.net/11356/1834
The ParlaSpeech-RS.v1.0.jsonl (JSON lines) file consists of entries with the following attributes:
id: ParlaMint utterance ID with zero-based character offsets pointing to the specific part of the utterance
words: List of character and milisecond offsets to specific words in the trasncript, especially useful for further segmentation of each entry
audio: path to the FLAC file (available from the part*.tgz files), the folder name corresponding to the YouTube video ID
audio_length: length of the recording in seconds
text: transcript of the audio
text_start: starting character position in the original ParlaMint 4.0 utterance
text_end: ending character position in the original ParlaMint 4.0 utterance
audio_start: starting milisecond position in the original YouTube video
audio_end: ending milisecond position in the original YouTube video
speaker_info: full information on the speaker (and speech) fr . . .
                                            
