Dataset of Annotated Slovene Words with Pre-Consonant L ILS 1.0

Name: Dataset of Annotated Slovene Words with Pre-Consonant L ILS 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Čibej, Jaka

Show simple item record

dc.contributor.author	Čibej, Jaka
dc.date.accessioned	2025-05-13T07:41:34Z
dc.date.available	2025-05-13T07:41:34Z
dc.date.issued	2024-12-20
dc.identifier.uri	http://hdl.handle.net/11356/2025
dc.description	ILS is a dataset containing Slovene word forms containing a single lC bigram, i.e. an "l" grapheme preceding a consonant grapheme (a bigram of "l"+C(onsonant) = lC bigram). This combination is one of the less predictable pronunciation ambiguities in Slovene, as the "l" grapheme is sometimes pronounced as /l/ (e.g. "alge") and sometimes as /u̯/ (e.g. "polža"). In some cases, both variants are acceptable (e.g. "morilka"), but there is disagreement within the linguistic community on which pronunciations are acceptable in standard Slovene. The word forms containing an lC bigram were extracted from the manually validated lexemes of Sloleks 3.0 (http://hdl.handle.net/11356/1745). Approximately 6,600 lexemes were exported along with their inflected forms. The inflected forms were then annotated by 5 linguists in PyBossa (https://docs.pybossa.com/). Each set of forms within a lexeme were annotated by two linguists in terms of the standard Slovene pronunciation of the lC bigram (L, U, or both). The dataset enables additional linguistic analyses of the pronunciation of L in pre-consonant position in Slovene words and can be used as a starting point to identify the most problematic points of disagreement in pronunciation which can be included in future studies. Version 1.0 includes 173.419 annotated word forms with 2 annotations each. Forms containing multiple lC bigrams were excluded in this version as they only account for approximately 5 % of all lC bigram forms; these will be included in future versions. For a more detailed description of the file structure, please see 00README.txt.
dc.language.iso	slv
dc.publisher	Faculty of Arts, University of Ljubljana
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.publisher	Jožef Stefan Institute
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.subject	spoken Slovene
dc.subject	pronunciation ambiguity
dc.subject	pre-consonant l grapheme
dc.title	Dataset of Annotated Slovene Words with Pre-Consonant L ILS 1.0
dc.type	lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType	wordList
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Jaka Čibej jaka.cibej@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor	ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	173419 words
files.count	1
files.size	1103882