Files in this item
This item is 
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
 and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Name
 - ParlaSpeech-RS.v1.0.jsonl.gz
 - Size
 - 102.73 MB
 - Format
 - application/gzip
 - Description
 - Corpus text in gzipped JSON Lines format
 - MD5
 - 4b83f759fabd6d0dcb1bf391090b2143
 
- Name
 - ParlaSpeech-RS.v1.0.part1.tgz
 - Size
 - 36.41 GB
 - Format
 - Unknown
 - Description
 - Speech in FLAC format, part 1
 - MD5
 - 83ff0608114a8c2701f712112ce88f03
 
- Name
 - ParlaSpeech-RS.v1.0.part2.tgz
 - Size
 - 26.62 GB
 - Format
 - Unknown
 - Description
 - Speech in FLAC format, part 2
 - MD5
 - 628efb94708a9e10d02fd825ac853a4c
 
- Name
 - README.txt
 - Size
 - 1 KB
 - Format
 - Text file
 - Description
 - Description of the corpus format
 - MD5
 - dc33d4dd9eb8d6b8a29a28fd1ed309cf
 
Parliamentary spoken corpus of Serbian ParlaSpeech-RS v1.0
http://hdl.handle.net/11356/1834
The ParlaSpeech-RS.v1.0.jsonl (JSON lines) file consists of entries with the following attributes:
id: ParlaMint utterance ID with zero-based character offsets pointing to the specific part of the utterance
words: List of character and milisecond offsets to specific words in the trasncript, especially useful for further segmentation of each entry
audio: path to the FLAC file (available from the part*.tgz files), the folder name corresponding to the YouTube video ID
audio_length: length of the recording in seconds
text: transcript of the audio
text_start: starting character position in the original ParlaMint 4.0 utterance
text_end: ending character position in the original ParlaMint 4.0 utterance
audio_start: starting milisecond position in the original YouTube video
audio_end: ending milisecond position in the original YouTube video
speaker_info: full information on the speaker (and speech) fr . . .