Datoteke v tem vnosu

Icon
Ime
ParlaSpeech-PL.v1.0.jsonl.gz
Velikost
124.96 MB
Format
application/gzip
Opis
Corpus text in gzipped JSON Lines format
MD5
a186d99cf96f15be6898cf86bc261f34
 Prenesi datoteko
Icon
Ime
ParlaSpeech-PL.v1.0.part1.tgz
Velikost
27.87 GB
Format
Neznano
Opis
Speech in FLAC format, part 1
MD5
cbef0242706ee876bd27e7e151c69ba2
 Prenesi datoteko
Icon
Ime
ParlaSpeech-PL.v1.0.part2.tgz
Velikost
30.74 GB
Format
Neznano
Opis
Speech in FLAC format, part 2
MD5
95332c745a4c79a56dcc78bc34b30cb1
 Prenesi datoteko
Icon
Ime
README.txt
Velikost
1 KB
Format
Besedilna datoteka
Opis
Description of the corpus format
MD5
53d3b9c770e2ed6f4cbff71b6d4f267e
 Prenesi datoteko  Predogled
 Predogled datoteke  
Parliamentary spoken corpus of Polish ParlaSpeech-PL v1.0
http://hdl.handle.net/11356/1686

The ParlaSpeech-PL.v1.0.jsonl (JSON lines) file consists of entries with the following attributes:

id: ParlaMint utterance ID with zero-based character offsets pointing to the specific part of the utterance
words: List of character and milisecond offsets to specific words in the trasncript, especially useful for further segmentation of each entry
audio: path to the FLAC file (available from the part*.tgz files), the folder name corresponding to the YouTube video ID
audio_length: length of the recording in seconds
text: transcript of the audio
text_start: starting character position in the original ParlaMint 4.0 utterance
text_end: ending character position in the original ParlaMint 4.0 utterance
audio_start: starting milisecond position in the original YouTube video
audio_end: ending milisecond position in the original YouTube video
speaker_info: full information on the speaker (and speech) fro . . .