ASR database ARTUR 1.0 (transcriptions)

Name: ASR database ARTUR 1.0 (transcriptions)
License: https://creativecommons.org/licenses/by-sa/4.0/

Verdonik, Darinka; Bizjak, Andreja; Sepesy Maučec, Mirjam; Gril, Lucija; Dobrišek, Simon; Križaj, Janez; Strle, Gregor; Bajec, Marko; Lebar Bajec, Iztok; Jelovšek, Tjaša; Lokovšek, Jure; Trojar, Mitja; Erjavec, Tomaž; Bernjak, Mitja; Žganec Gros, Jerneja; Čakš, Peter; Pucer, Matevž; Cvetko, Mitja; Pavlič, Jani; Zelenik, Marijana; Ivanovska, Marija; Grm, Klemen; Longyka, Jure; Mihelič, Aleš; Vesnicer, Boštjan; Dretnik, Naum

dc.contributor.author	Verdonik, Darinka
dc.contributor.author	Bizjak, Andreja
dc.contributor.author	Sepesy Maučec, Mirjam
dc.contributor.author	Gril, Lucija
dc.contributor.author	Dobrišek, Simon
dc.contributor.author	Križaj, Janez
dc.contributor.author	Strle, Gregor
dc.contributor.author	Bajec, Marko
dc.contributor.author	Lebar Bajec, Iztok
dc.contributor.author	Jelovšek, Tjaša
dc.contributor.author	Lokovšek, Jure
dc.contributor.author	Trojar, Mitja
dc.contributor.author	Erjavec, Tomaž
dc.contributor.author	Bernjak, Mitja
dc.contributor.author	Žganec Gros, Jerneja
dc.contributor.author	Čakš, Peter
dc.contributor.author	Pucer, Matevž
dc.contributor.author	Cvetko, Mitja
dc.contributor.author	Pavlič, Jani
dc.contributor.author	Zelenik, Marijana
dc.contributor.author	Ivanovska, Marija
dc.contributor.author	Grm, Klemen
dc.contributor.author	Longyka, Jure
dc.contributor.author	Mihelič, Aleš
dc.contributor.author	Vesnicer, Boštjan
dc.contributor.author	Dretnik, Naum
dc.date.accessioned	2023-03-05T08:41:37Z
dc.date.available	2023-03-05T08:41:37Z
dc.date.issued	2023-02-22
dc.identifier.uri	http://hdl.handle.net/11356/1772
dc.description	Artur 1.0 is a speech database designed for the needs of developing automatic speech recognition for the Slovenian language. The complete database includes 1,067 hours of speech, of which 884 hours are transcribed, while the remaining 183 hours are recordings only. This repository entry includes transcriptions only, while the audio files are available on http://hdl.handle.net/11356/1776. Transcriptions are available in the original TRS format of the Transcriber 1.5.1 tool which was used for making the transcriptions. All transcriptions were made manually or manually corrected. The data are structured as follows: (1) Artur-B, read speech, 573 hours in total. It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own transcription file and has a corresponding audio file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. The transcriptions were corrected manually. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own transcription file and has a corresponding recording. (1d) Artur-B-Izloceno, 27 hours: in trs format only. The recordings that correspond to these transcriptions include different types of errors, typically, incorrect reading of sentences or a noisy environment. (2) Artur-J, public speech, 62 hours in total. It includes: (2a) Artur-J-Splosni, 62 hours: manual transcriptions of media recordings, online recordings of conferences, workshops, education videos, etc. Transcriptions were made in two modes: - 'pog' files include the pronunciation-based or citation-phonemic transcriptions (containing the output phoneme string derived from the orthographic form by letter-to-sound rules) - 'std' files include standardised or expanded orthographic transcriptions (the standard Slovenian spelling is used to indicate the spoken words, but there are additional rules and word-lists for non-standard lexis) (3) Artur-N, private speech, 74 hours in total. It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics. The manual transcriptions were done in two modes, the same as for Artur-J. (4) Artur-P, parliamentary speech, 201 hours in total. It includes: (4a) Artur-P-SejeDZ, 201 hours: Transcriptions of speech from the Slovene National Assembly. Manual transcriptions were done in two modes, the same as for Artur-J. Further information on the database, including various statistics, are available in the Artur-DOC directory, which is part of Artur_1.0_TRS.
dc.language.iso	slv
dc.publisher	Faculty of Electrical Engineering and Computer Science, University of Maribor
dc.publisher	Faculty of Electrical Engineering, University of Ljubljana
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.publisher	ZRC SAZU
dc.relation.replaces	http://hdl.handle.net/11356/1718
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://slovenscina.eu/
dc.subject	speech database
dc.subject	spoken language
dc.subject	spoken corpus
dc.subject	automatic speech recognition
dc.title	ASR database ARTUR 1.0 (transcriptions)
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Darinka Verdonik darinka.verdonik@um.si Faculty of Electrical Engineering and Computer Science, University of Maribor
contact.person	Simon Dobrišek simon.dobrisek@fe.uni-lj.si Faculty of Electrical Engineering, University of Ljubljana
contact.person	Marko Bajec marko.bajec@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
contact.person	Mitja Trojar mitja.trojar@zrc-sazu.si ZRC SAZU
contact.person	Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor	Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
size.info	884 hours
files.count	1
files.size	50992514