2026-05-21T20:37:49Zhttp://www.clarin.si/repository/oai/request

oai:www.clarin.si:11356/20382025-08-11T10:54:30Zhdl_11356_1023hdl_11356_1024

Dataset for primary stress identification in Croatian and related languages and dialects Ljubešić, Nikola Rupnik, Peter Porupski, Ivan Robida, Nejc Potočnjak, Mirna primary stress word stress phonetics speech database The dataset contains recordings and offset annotations of a sample of the Croaitan parliamentary recordings from the corpus ParlaSpeech-HR. It contains training and testing data for primary stress identification from the speech signal on the level of a single word. Additional test datasets are available in three languages / dialects: Slovenian, Chakavian dialect of Croatian, and Serbian. The data is split in four sections based on their provenance: ParlaStress-HR.jsonl - Croatian train and test datasets, sampled from the ParlaSpeech-HR 2.0 (http://hdl.handle.net/11356/1914) ParlaStress-SR.jsonl - Serbian test dataset, sampled from the ParlaSpeech-RS (http://hdl.handle.net/11356/1834) MićiPrinc-CKM.jsonl - Chakavian test dataset, sampled from the Mići Princ dataset (http://hdl.handle.net/11356/1765) Artur-SL.jsonl - Slovenian test dataset, sampled from the Artur dataset (http://hdl.handle.net/11356/1776) All JSONL files have the following attributes: * id: string * audio_wav: string, path to the audio file * audio_start, audio_end: float, seconds of the start and end times in the original audio file, useful for calculating sample duration, as well as reference to original audio * multisyllabic_words: a list of dictionaries, each entry corresponding to one multisyllabic word with stress information, with keys: word: string, word in question time_s: float, start of word in seconds from the start of the recording, time_e: float, end of word in seconds from the start of the recording, syllable_count: int, number of syllables in the word, stress: a list with a single dictionary (for consistency with unstressed) describing the stressed vowel with keys: vowel: string, character of the word that is stressed time_s: float, vowel start in seconds from the start of the word time_e: float, vowel end in seconds from the start of the word char_idx: int, index of stressed character in the word unstress: same as stress, but for unstressed vowels * graphalign_intervals: a list of dictionaries describing time alignment of individual graphemes / phonemes, with keys: label: string, character that is being aligned time_s: float, character start in seconds from the start of the word time_e: float, character end in seconds from the start of the word In addition, ParlaStress-HR.jsonl also has the attribute "split_speaker" that assigns individual instances into "train" or "test" splits. These splits ensure that different speakers are found in the training and the testing section. 2025-05-30 corpus http://hdl.handle.net/11356/2038 hrv slv srp ckm https://doi.org/10.48550/arXiv.2505.24571 Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) https://creativecommons.org/licenses/by-sa/4.0/ PUB text/plain; charset=utf-8 application/zip application/zip text/plain downloadable_files_count: 3 Jožef Stefan Institute https://clarinsi.github.io/parlaspeech/