Show simple item record

 
dc.contributor.author Verdonik, Darinka
dc.contributor.author Bizjak, Andreja
dc.contributor.author Sepesy Maučec, Mirjam
dc.contributor.author Gril, Lucija
dc.contributor.author Dobrišek, Simon
dc.contributor.author Križaj, Janez
dc.contributor.author Strle, Gregor
dc.contributor.author Bajec, Marko
dc.contributor.author Lebar Bajec, Iztok
dc.contributor.author Jelovšek, Tjaša
dc.contributor.author Lokovšek, Jure
dc.contributor.author Trojar, Mitja
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Bernjak, Mitja
dc.contributor.author Žganec Gros, Jerneja
dc.contributor.author Čakš, Peter
dc.contributor.author Pucer, Matevž
dc.contributor.author Cvetko, Mitja
dc.contributor.author Pavlič, Jani
dc.contributor.author Zelenik, Marijana
dc.contributor.author Ivanovska, Marija
dc.contributor.author Grm, Klemen
dc.contributor.author Longyka, Jure
dc.contributor.author Mihelič, Aleš
dc.contributor.author Vesnicer, Boštjan
dc.contributor.author Dretnik, Naum
dc.date.accessioned 2023-03-05T08:41:37Z
dc.date.available 2023-03-05T08:41:37Z
dc.date.issued 2023-02-22
dc.identifier.uri http://hdl.handle.net/11356/1772
dc.description Artur 1.0 is a speech database designed for the needs of developing automatic speech recognition for the Slovenian language. The complete database includes 1,067 hours of speech, of which 884 hours are transcribed, while the remaining 183 hours are recordings only. This repository entry includes transcriptions only, while the audio files are available on http://hdl.handle.net/11356/1776. Transcriptions are available in the original TRS format of the Transcriber 1.5.1 tool which was used for making the transcriptions. All transcriptions were made manually or manually corrected. The data are structured as follows: (1) Artur-B, read speech, 573 hours in total. It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own transcription file and has a corresponding audio file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. The transcriptions were corrected manually. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own transcription file and has a corresponding recording. (1d) Artur-B-Izloceno, 27 hours: in trs format only. The recordings that correspond to these transcriptions include different types of errors, typically, incorrect reading of sentences or a noisy environment. (2) Artur-J, public speech, 62 hours in total. It includes: (2a) Artur-J-Splosni, 62 hours: manual transcriptions of media recordings, online recordings of conferences, workshops, education videos, etc. Transcriptions were made in two modes: - 'pog' files include the pronunciation-based or citation-phonemic transcriptions (containing the output phoneme string derived from the orthographic form by letter-to-sound rules) - 'std' files include standardised or expanded orthographic transcriptions (the standard Slovenian spelling is used to indicate the spoken words, but there are additional rules and word-lists for non-standard lexis) (3) Artur-N, private speech, 74 hours in total. It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics. The manual transcriptions were done in two modes, the same as for Artur-J. (4) Artur-P, parliamentary speech, 201 hours in total. It includes: (4a) Artur-P-SejeDZ, 201 hours: Transcriptions of speech from the Slovene National Assembly. Manual transcriptions were done in two modes, the same as for Artur-J. Further information on the database, including various statistics, are available in the Artur-DOC directory, which is part of Artur_1.0_TRS.
dc.language.iso slv
dc.publisher Faculty of Electrical Engineering and Computer Science, University of Maribor
dc.publisher Faculty of Electrical Engineering, University of Ljubljana
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.publisher ZRC SAZU
dc.relation.replaces http://hdl.handle.net/11356/1718
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://slovenscina.eu/
dc.subject speech database
dc.subject spoken language
dc.subject spoken corpus
dc.subject automatic speech recognition
dc.title ASR database ARTUR 1.0 (transcriptions)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Darinka Verdonik darinka.verdonik@um.si Faculty of Electrical Engineering and Computer Science, University of Maribor
contact.person Simon Dobrišek simon.dobrisek@fe.uni-lj.si Faculty of Electrical Engineering, University of Ljubljana
contact.person Marko Bajec marko.bajec@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
contact.person Mitja Trojar mitja.trojar@zrc-sazu.si ZRC SAZU
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
size.info 884 hours
files.count 1
files.size 50992514


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
Artur_1.0_TRS.tgz
Size
48.63 MB
Format
Unknown
Description
Compressed TAR archive of the TRS corpus and documentation
MD5
6f21947593ccdea7dc23ecc3c9a7c012
 Download file

Show simple item record