dc.contributor.author | Čibej, Jaka |
dc.date.accessioned | 2025-05-13T07:41:34Z |
dc.date.available | 2025-05-13T07:41:34Z |
dc.date.issued | 2024-12-20 |
dc.identifier.uri | http://hdl.handle.net/11356/2025 |
dc.description | ILS is a dataset containing Slovene word forms containing a single lC bigram, i.e. an "l" grapheme preceding a consonant grapheme (a bigram of "l"+C(onsonant) = lC bigram). This combination is one of the less predictable pronunciation ambiguities in Slovene, as the "l" grapheme is sometimes pronounced as /l/ (e.g. "alge") and sometimes as /u̯/ (e.g. "polža"). In some cases, both variants are acceptable (e.g. "morilka"), but there is disagreement within the linguistic community on which pronunciations are acceptable in standard Slovene. The word forms containing an lC bigram were extracted from the manually validated lexemes of Sloleks 3.0 (http://hdl.handle.net/11356/1745). Approximately 6,600 lexemes were exported along with their inflected forms. The inflected forms were then annotated by 5 linguists in PyBossa (https://docs.pybossa.com/). Each set of forms within a lexeme were annotated by two linguists in terms of the standard Slovene pronunciation of the lC bigram (L, U, or both). The dataset enables additional linguistic analyses of the pronunciation of L in pre-consonant position in Slovene words and can be used as a starting point to identify the most problematic points of disagreement in pronunciation which can be included in future studies. Version 1.0 includes 173.419 annotated word forms with 2 annotations each. Forms containing multiple lC bigrams were excluded in this version as they only account for approximately 5 % of all lC bigram forms; these will be included in future versions. For a more detailed description of the file structure, please see 00README.txt. |
dc.language.iso | slv |
dc.publisher | Faculty of Arts, University of Ljubljana |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.publisher | Jožef Stefan Institute |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.subject | spoken Slovene |
dc.subject | pronunciation ambiguity |
dc.subject | pre-consonant l grapheme |
dc.title | Dataset of Annotated Slovene Words with Pre-Consonant L ILS 1.0 |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | wordList |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Jaka Čibej jaka.cibej@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana |
sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
size.info | 173419 words |
files.count | 1 |
files.size | 1103882 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- ILS_1.0.zip
- Size
- 1.05 MB
- Format
- application/zip
- Description
- ILS 1.0 (Dataset in TSV format)
- MD5
- aaadf5db5cddaf82e6204181fe479049