Show simple item record

 
dc.contributor.author Verdonik, Darinka
dc.contributor.author Bizjak, Andreja
dc.contributor.author Sepesy Maučec, Mirjam
dc.contributor.author Gril, Lucija
dc.contributor.author Dobrišek, Simon
dc.contributor.author Križaj, Janez
dc.contributor.author Strle, Gregor
dc.contributor.author Bajec, Marko
dc.contributor.author Lebar Bajec, Iztok
dc.contributor.author Jelovšek, Tjaša
dc.contributor.author Lokovšek, Jure
dc.contributor.author Trojar, Mitja
dc.date.accessioned 2022-12-06T15:28:54Z
dc.date.available 2022-12-06T15:28:54Z
dc.date.issued 2022-12-06
dc.identifier.uri http://hdl.handle.net/11356/1718
dc.description ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840 hours are transcribed, while the remaining 195 hours are without transcription. The data is divided into 4 parts: (1) approx. 520 hours of read speech, which includes the reading of pre-defined sentences, selected from the corpus Gigafida; each sentence is contained in one file; speakers are demographically balanced; spelling is included in special files; all with manual transcriptions; (2) approx. 204 hours of public speech, which includes media recordings, online recordings of conferences, workshops, education videos, etc.; 56 hours are manually transcribed; (3) approx. 110 hours of private speech, which includes monologues and dialogues between two persons, recorded for the purposes of the speech database; the speakers are demographically balanced; two subsets for domain-specific ASR (i.e., smart-home and face-description) are included; 63 hours are manually transcribed; (4) approx. 201 hours of parliamentary speech, which includes recordings from the Slovene National Assembly, all with manual transcriptions. This repository entry includes transcriptions in Transcriber 1.5.1 TRS format only; audio recordings are available at http://hdl.handle.net/11356/1717.
dc.language.iso slv
dc.publisher Faculty of Electrical Engineering and Computer Science, University of Maribor
dc.publisher Faculty of Electrical Engineering, University of Ljubljana
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.publisher ZRC SAZU
dc.relation.isreplacedby http://hdl.handle.net/11356/1772
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://slovenscina.eu/
dc.subject speech database
dc.subject spoken language
dc.subject spoken corpus
dc.subject automatic speech recognition
dc.title ASR database ARTUR 0.1 (transcriptions)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hidden hidden
has.files yes
branding CLARIN.SI data & tools
contact.person Darinka Verdonik darinka.verdonik@um.si Faculty of Electrical Engineering and Computer Science, University of Maribor
contact.person Simon Dobrišek simon.dobrisek@fe.uni-lj.si Faculty of Electrical Engineering, University of Ljubljana
contact.person Marko Bajec marko.bajec@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
contact.person Mitja Trojar mitja.trojar@zrc-sazu.si ZRC SAZU
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
size.info 722 hours
files.count 2
files.size 46743263


 Files in this item

 Download all files in item (44.58 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
Artur_0.1_Transcriptions.tgz
Size
44.58 MB
Format
Unknown
Description
Unix TAR archive of all the corpus files compressed with Gnu Zip
MD5
42b28577a7ece758a27abf5bbd2a1e67
 Download file
Icon
Name
00Preberime.txt
Size
2.17 KB
Format
Text file
Description
Corpus file structure description (in Slovenian)
MD5
9a0b7646a1b9b552e99678cc15cfe254
 Download file  Preview
 File Preview  
00Preberime.txt			- ta datoteka
Artur-DOC			- mapa z dokumentacijo o bazi Artur
	Artur-B				- mapa z dokumentacijo o branem govoru
	Artur-J				- mapa z dokumentacijo o javnem govoru
	Artur-N				- mapa z dokumentacijo o nejavnem govoru
	Artur-P				- mapa z dokumentacijo o parlamentarnem govoru
Artur-TRS			- mapa s transkripcijami v trs-formatu
	Artur-B				- mapa z datotekami branega govora
		00Artur-B-Govorci.tsv		- datoteka s podatki o govorcih branega govora
		00Artur-B-Posnetki.tsv		- datoteka s podatki o posnetkih branega govora
		Artur-B-Brani			- mapa s transkripcijami branega govora
			Artur-B-G0001 itd.		- mape s transkripcijami za posameznega govorca
		Artur-B-Crkovani		- mapa s transkripcijami črkovanja
			Artur-B-G0501 itd.		- mape s transkripcijami za posameznega govorca
		Artur-B-Izloceno		- mapa s transkripcijami, ki odstopajo od specificiranih kriterijev (slaba kvaliteta wav, napaka pri branju...)
			Artur-B-G0001 itd.		- mape s transkripcijami za posamezne . . .
                                            

Show simple item record