2026-05-27T23:10:37Zhttp://www.clarin.si/repository/oai/request

oai:www.clarin.si:11356/22092026-05-21T10:55:46Zhdl_11356_1023hdl_11356_1024

Phonetic segmentation and acoustic measurements of spoken Slovenian SloPhonSeg 1.0 Robida, Nejc Križaj, Janez spoken corpus forced alignment phonetics acoustic-phonetic measurements SloPhonSeg 1.0 is a dataset of automatically generated phonetic segmentations and acoustic-phonetic measurements for selected recordings and transcriptions from the spoken corpus Gos 2.1 (http://hdl.handle.net/11356/1863). The resource contains derived data in two complementary formats: Praat TextGrid annotation files with time-aligned segmentations, and TSV tables with acoustic-phonetic measurements. The TextGrid files align the transcriptions with the audio recordings and segment them into utterances, words, syllables, and phones, while the TSV files provide one row per phone-tier interval and include acoustic measurements, phone context, and token-level metadata. The packaged sample contains 106 recordings and transcriptions. It was selected from recordings in which speakers were marked in the corpus metadata as using standard Slovene, supplemented by additional recordings that were manually confirmed as predominantly standard. The intended selected-speaker sample is gender-balanced, with 66 female and 66 male primary speakers; the packaged recording metadata lists all speakers present in the selected recordings, including additional participants and group/audience identifiers. The sample is based on the five Gos 2.1 subcorpora, with the following distribution: (1) Spoken corpus Gos 1.1 (http://hdl.handle.net/11356/1438), labelled as Gos in the metadata, 22 recordings, 99,328 source-metadata word tokens. (2) Spoken corpus Gos VideoLectures (http://hdl.handle.net/11356/1444), labelled as GosVL in the metadata, 15 recordings, 52,402 source-metadata word tokens. (3) A selection from the ASR database ARTUR 1.0 (http://hdl.handle.net/11356/1772), including: (3a) Artur-J, 49 recordings, 301,830 source-metadata word tokens: interviews and online events, such as conferences, workshops, and educational videos. (3b) Artur-P, 17 recordings, 32,607 source-metadata word tokens: transcribed speech from the Slovene National Assembly. (3c) Artur-N, 3 recordings, 5,112 source-metadata word tokens: non-public speech. The resource provides three parallel versions of the segmentation, differing in the level of phonetic detail: an allophonic phone-level segmentation with 61 phone labels, a diphthong segmentation with 82 phone labels, and a simplified phonemic segmentation with 44 phone labels. Each version contains 106 TextGrid files and 106 measurement TSV files; the aggregate segmentation statistics report 2,951.55 speech minutes and 395,282 aligned word tokens per version. The segmentations were produced through forced alignment using the Montreal Forced Aligner (MFA), a Slovene acoustic model, and a pronunciation workflow based on OptiLEX. The TextGrid files contain tiers for speaker identifiers, standardised and conversational transcriptions, word identifiers, word and syllable segments, phone segments, and automatically generated prosodic and discourse-related cues. The TSV files report duration, average pitch, pitch trend, formant frequencies, intensity, sonority, automatically computed Voice Onset Time (VOT), centre of gravity (COG), preceding and following phone labels, aligned token identifier, MULTEXT-East morphosyntactic description (MSD), utterance context, audio identifier, and speaker identifier. The corresponding audio (and, in part, video) files are available under a restricted licence at http://hdl.handle.net/11356/1973. 2026-05-22 corpus http://hdl.handle.net/11356/2209 slv Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) https://creativecommons.org/licenses/by-sa/4.0/ PUB text/plain; charset=utf-8 application/zip application/zip application/zip downloadable_files_count: 3 Centre for Language Resources and Technologies, University of Ljubljana http://mezzanine.um.si/