| dc.contributor.author | Čibej, Jaka |
| dc.contributor.author | Gantar, Kaja |
| dc.contributor.author | Gantar, Polona |
| dc.contributor.author | Šešet, Jure |
| dc.contributor.author | Krek, Simon |
| dc.contributor.author | Robida, Nejc |
| dc.date.accessioned | 2026-02-14T08:22:10Z |
| dc.date.available | 2026-02-14T08:22:10Z |
| dc.date.issued | 2026-02-03 |
| dc.identifier.uri | http://hdl.handle.net/11356/2084 |
| dc.description | MEZZANINE-NstdLex is a dataset containing 4,237 potentially non-standard vocabulary candidates from the Sloleks Morphological Lexicon of Slovene (collected from among the manually inspected entries of version 3.0; http://hdl.handle.net/11356/1745) and various corpora of spoken Slovene (e.g. GOS 1.1 http://hdl.handle.net/11356/1438; GOS-VL 4.2 http://hdl.handle.net/11356/1444; Artur 1.0 http://hdl.handle.net/11356/1772) and transcriptions (not publicly available) of Slovene university lectures used for the Online Notes project (https://www.cjvt.si/online-notes/). Most of the candidates were collected through manual analysis, with the exception of 1,232 candidate pairs in file "MEZZANINE-NstdLex__Sloleks3.0_levenshtein.tsv", which were extracted automatically by comparing pairs of entries using Levenshtein distance to extract potential non-standard pairs that differ in the presence of the letter "j" (e.g. "genialec" vs. "genijalec"). The candidates were manually annotated in terms of their non-standardness according to a custom typology (STD - standard vocabulary, NST - vocabulary that is non-standard in terms of register, OBL - vocabulary that is non-standard in terms of form, NSTg - vocabulary that is non-standard and typically spoken; combinations of these tags are also possible for borderline examples). The main purpose of the dataset is to provide an overview of different types of non-standard and spoken words in Slovene. The overview can be used as a basis for a robust, empirical development of lexicographic labels for Slovene language resources. For more information on the structure of the dataset, please consult 00README.txt. |
| dc.language.iso | slv |
| dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
| dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
| dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
| dc.rights.label | PUB |
| dc.source.uri | http://mezzanine.um.si/ |
| dc.subject | spoken Slovene |
| dc.subject | non-standard language |
| dc.subject | non-standardness |
| dc.subject | lexicon |
| dc.title | List of potentially non-standard vocabulary candidates MEZZANINE-NstdLex 1.0 |
| dc.type | lexicalConceptualResource |
| metashare.ResourceInfo#ContentInfo.detailedType | lexicon |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Jaka Čibej jaka.cibej@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana |
| sponsor | ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds |
| sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
| sponsor | University of Ljubljana P6-0215 Slovene Language - Basic, Contrastive, and Applied Studies nationalFunds |
| size.info | 3 files |
| size.info | 4237 words |
| files.count | 1 |
| files.size | 84111 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Name
- MEZZANINE-NstdLex_1.0.zip
- Size
- 82.14 KB
- Format
- application/zip
- Description
- TSV files
- MD5
- 8b81aa0bca84f18fb6315a1cf8422b0a
- MEZZANINE-NstdLex_1.0
- MEZZANINE-NstdLex__Sloleks3.0_levenshtein.tsv62 kB
- 00README.txt6 kB
- MEZZANINE-NstdLex__speech-corpora.tsv15 kB
- MEZZANINE-NstdLex__Sloleks3.0_manual.tsv111 kB