Show simple item record

 
dc.contributor.author Čibej, Jaka
dc.contributor.author Gantar, Kaja
dc.contributor.author Gantar, Polona
dc.contributor.author Šešet, Jure
dc.contributor.author Krek, Simon
dc.contributor.author Robida, Nejc
dc.date.accessioned 2026-02-14T08:22:10Z
dc.date.available 2026-02-14T08:22:10Z
dc.date.issued 2026-02-03
dc.identifier.uri http://hdl.handle.net/11356/2084
dc.description MEZZANINE-NstdLex is a dataset containing 4,237 potentially non-standard vocabulary candidates from the Sloleks Morphological Lexicon of Slovene (collected from among the manually inspected entries of version 3.0; http://hdl.handle.net/11356/1745) and various corpora of spoken Slovene (e.g. GOS 1.1 http://hdl.handle.net/11356/1438; GOS-VL 4.2 http://hdl.handle.net/11356/1444; Artur 1.0 http://hdl.handle.net/11356/1772) and transcriptions (not publicly available) of Slovene university lectures used for the Online Notes project (https://www.cjvt.si/online-notes/). Most of the candidates were collected through manual analysis, with the exception of 1,232 candidate pairs in file "MEZZANINE-NstdLex__Sloleks3.0_levenshtein.tsv", which were extracted automatically by comparing pairs of entries using Levenshtein distance to extract potential non-standard pairs that differ in the presence of the letter "j" (e.g. "genialec" vs. "genijalec"). The candidates were manually annotated in terms of their non-standardness according to a custom typology (STD - standard vocabulary, NST - vocabulary that is non-standard in terms of register, OBL - vocabulary that is non-standard in terms of form, NSTg - vocabulary that is non-standard and typically spoken; combinations of these tags are also possible for borderline examples). The main purpose of the dataset is to provide an overview of different types of non-standard and spoken words in Slovene. The overview can be used as a basis for a robust, empirical development of lexicographic labels for Slovene language resources. For more information on the structure of the dataset, please consult 00README.txt.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri http://mezzanine.um.si/
dc.subject spoken Slovene
dc.subject non-standard language
dc.subject non-standardness
dc.subject lexicon
dc.title List of potentially non-standard vocabulary candidates MEZZANINE-NstdLex 1.0
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType lexicon
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Jaka Čibej jaka.cibej@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) J7-4642 MEZZANINE nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor University of Ljubljana P6-0215 Slovene Language - Basic, Contrastive, and Applied Studies nationalFunds
size.info 3 files
size.info 4237 words
files.count 1
files.size 84111


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
MEZZANINE-NstdLex_1.0.zip
Size
82.14 KB
Format
application/zip
Description
TSV files
MD5
8b81aa0bca84f18fb6315a1cf8422b0a
 Download file  Preview
 File Preview  
  • MEZZANINE-NstdLex_1.0
    • MEZZANINE-NstdLex__Sloleks3.0_levenshtein.tsv62 kB
    • 00README.txt6 kB
    • MEZZANINE-NstdLex__speech-corpora.tsv15 kB
    • MEZZANINE-NstdLex__Sloleks3.0_manual.tsv111 kB

Show simple item record