Show simple item record

 
dc.contributor.author Čibej, Jaka
dc.date.accessioned 2025-05-13T07:41:34Z
dc.date.available 2025-05-13T07:41:34Z
dc.date.issued 2024-12-20
dc.identifier.uri http://hdl.handle.net/11356/2025
dc.description ILS is a dataset containing Slovene word forms containing a single lC bigram, i.e. an "l" grapheme preceding a consonant grapheme (a bigram of "l"+C(onsonant) = lC bigram). This combination is one of the less predictable pronunciation ambiguities in Slovene, as the "l" grapheme is sometimes pronounced as /l/ (e.g. "alge") and sometimes as /u̯/ (e.g. "polža"). In some cases, both variants are acceptable (e.g. "morilka"), but there is disagreement within the linguistic community on which pronunciations are acceptable in standard Slovene. The word forms containing an lC bigram were extracted from the manually validated lexemes of Sloleks 3.0 (http://hdl.handle.net/11356/1745). Approximately 6,600 lexemes were exported along with their inflected forms. The inflected forms were then annotated by 5 linguists in PyBossa (https://docs.pybossa.com/). Each set of forms within a lexeme were annotated by two linguists in terms of the standard Slovene pronunciation of the lC bigram (L, U, or both). The dataset enables additional linguistic analyses of the pronunciation of L in pre-consonant position in Slovene words and can be used as a starting point to identify the most problematic points of disagreement in pronunciation which can be included in future studies. Version 1.0 includes 173.419 annotated word forms with 2 annotations each. Forms containing multiple lC bigrams were excluded in this version as they only account for approximately 5 % of all lC bigram forms; these will be included in future versions. For a more detailed description of the file structure, please see 00README.txt.
dc.language.iso slv
dc.publisher Faculty of Arts, University of Ljubljana
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.subject spoken Slovene
dc.subject pronunciation ambiguity
dc.subject pre-consonant l grapheme
dc.title Dataset of Annotated Slovene Words with Pre-Consonant L ILS 1.0
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Jaka Čibej jaka.cibej@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 173419 words
files.count 1
files.size 1103882


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
ILS_1.0.zip
Size
1.05 MB
Format
application/zip
Description
ILS 1.0 (Dataset in TSV format)
MD5
aaadf5db5cddaf82e6204181fe479049
 Download file  Preview
 File Preview  
    • ILS_1.0.tsv18 MB
    • 00README.txt4 kB

Show simple item record