<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href='static/style.xsl' type='text/xsl'?><OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2026-05-21T20:37:49Z</responseDate><request verb="GetRecord" identifier="oai:www.clarin.si:11356/2038" metadataPrefix="oai_dc">http://www.clarin.si/repository/oai/request</request><GetRecord><record><header><identifier>oai:www.clarin.si:11356/2038</identifier><datestamp>2025-08-11T10:54:30Z</datestamp><setSpec>hdl_11356_1023</setSpec><setSpec>hdl_11356_1024</setSpec></header><metadata><oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Dataset for primary stress identification in Croatian and related languages and dialects</dc:title>
<dc:creator>Ljubešić, Nikola</dc:creator>
<dc:creator>Rupnik, Peter</dc:creator>
<dc:creator>Porupski, Ivan</dc:creator>
<dc:creator>Robida, Nejc</dc:creator>
<dc:creator>Potočnjak, Mirna</dc:creator>
<dc:subject>primary stress</dc:subject>
<dc:subject>word stress</dc:subject>
<dc:subject>phonetics</dc:subject>
<dc:subject>speech database</dc:subject>
<dc:description>The dataset contains recordings and offset annotations of a sample of the Croaitan parliamentary recordings from the corpus ParlaSpeech-HR. It contains training and testing data for primary stress identification from the speech signal on the level of a single word. Additional test datasets are available in three languages / dialects: Slovenian, Chakavian dialect of Croatian, and Serbian.&#xd;
&#xd;
The data is split in four sections based on their provenance:&#xd;
ParlaStress-HR.jsonl - Croatian train and test datasets, sampled from the ParlaSpeech-HR 2.0 (http://hdl.handle.net/11356/1914)&#xd;
ParlaStress-SR.jsonl - Serbian test dataset, sampled from the ParlaSpeech-RS (http://hdl.handle.net/11356/1834)&#xd;
MićiPrinc-CKM.jsonl - Chakavian test dataset, sampled from the Mići Princ dataset (http://hdl.handle.net/11356/1765)&#xd;
Artur-SL.jsonl - Slovenian test dataset, sampled from the Artur dataset (http://hdl.handle.net/11356/1776)&#xd;
&#xd;
All JSONL files have the following attributes:&#xd;
* id: string&#xd;
* audio_wav: string, path to the audio file&#xd;
* audio_start, audio_end: float, seconds of the start and end times in the original audio file, useful for calculating sample duration, as well as reference to original audio&#xd;
* multisyllabic_words: a list of dictionaries, each entry corresponding to one multisyllabic word with stress information, with keys:&#xd;
    word: string, word in question&#xd;
    time_s: float, start of word in seconds from the start of the recording,&#xd;
    time_e: float, end of word in seconds from the start of the recording,&#xd;
    syllable_count: int, number of syllables in the word,&#xd;
    stress: a list with a single dictionary (for consistency with unstressed) describing the stressed vowel with keys:&#xd;
        vowel: string, character of the word that is stressed&#xd;
        time_s: float, vowel start in seconds from the start of the word&#xd;
        time_e: float, vowel end in seconds from the start of the word&#xd;
        char_idx: int, index of stressed character in the word&#xd;
    unstress: same as stress, but for unstressed vowels&#xd;
* graphalign_intervals: a list of dictionaries describing time alignment of individual graphemes / phonemes, with keys:&#xd;
    label: string, character that is being aligned&#xd;
    time_s: float, character start in seconds from the start of the word&#xd;
    time_e: float, character end in seconds from the start of the word&#xd;
&#xd;
In addition, ParlaStress-HR.jsonl also has the attribute "split_speaker" that assigns individual instances into "train" or "test" splits. These splits ensure that different speakers are found in the training and the testing section.</dc:description>
<dc:date>2025-05-30</dc:date>
<dc:type>corpus</dc:type>
<dc:identifier>http://hdl.handle.net/11356/2038</dc:identifier>
<dc:language>hrv</dc:language>
<dc:language>slv</dc:language>
<dc:language>srp</dc:language>
<dc:language>ckm</dc:language>
<dc:relation>https://doi.org/10.48550/arXiv.2505.24571</dc:relation>
<dc:rights>Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)</dc:rights>
<dc:rights>https://creativecommons.org/licenses/by-sa/4.0/</dc:rights>
<dc:rights>PUB</dc:rights>
<dc:format>text/plain; charset=utf-8</dc:format>
<dc:format>application/zip</dc:format>
<dc:format>application/zip</dc:format>
<dc:format>text/plain</dc:format>
<dc:format>downloadable_files_count: 3</dc:format>
<dc:publisher>Jožef Stefan Institute</dc:publisher>
<dc:source>https://clarinsi.github.io/parlaspeech/</dc:source>
</oai_dc:dc>
</metadata></record></GetRecord></OAI-PMH>