Prikaži enostavni zapis vnosa

 
dc.contributor.author Pranjić, Marko
dc.contributor.author Kern, Boris
dc.contributor.author Voršič, Ines
dc.contributor.author Pollak, Senja
dc.date.accessioned 2026-03-09T15:54:39Z
dc.date.available 2026-03-09T15:54:39Z
dc.date.issued 2026-03-13
dc.identifier.uri http://hdl.handle.net/11356/2060
dc.description This dataset provides word-level multidimensional morphological annotations for Slovene, containing 1,935 entries manually annotated by two domain experts. The target words in the dataset were sampled from Sloleks 3.0 to provide data for morphological analysis, computational modeling, and linguistic research. The dataset is formatted as a lexicon (.tsv) containing five columns: 1. word: the target word 2. part_of_speech: the part-of-speech tag (noun, verb, adjective, adverb, or particle) 3. morphological_segments: all surface-level morphemes 4. word_formation_segments: derivational morphemes only 5. simplex: the base word(s) The dataset captures three distinct dimensions of morphological analysis, which are defined as follows: Morphological segments (the 'morphological_segments' column) identify all surface-level morphemes in a word, including both derivational and inflectional affixes. This segmentation describes how a word is modified to fit its grammatical role (such as encoding case, gender, and number). Word formation segments (the 'word_formation_segments' column) focus exclusively on the derivational processes used to create new words. Because inflectional morphology is a separate process that only modifies existing words, inflectional endings are excluded from word formation segments. For example, the adjective "nepozidan" ('not built-up') has the morphological segmentation "ne-po-zid-a-n-0" (capturing the inflectional state), whereas its word formation segmentation is "ne-po-zida-n", reflecting its specific derivational chain (zidati -> pozidati -> pozidan -> nepozidan). Zero-morphemes are integrated directly into both segmentation columns (represented by the character "0"). A zero-morpheme represents a morpheme without a phonetic form that is used to mark grammatical distinctions not explicitly realized in speech. It can function as both an inflectional morpheme (e.g., marking nominative masculine nouns that lack an explicit suffix) and a word formation morpheme necessary for deriving a specific part of speech from a base word. Simplex (the 'simplex' column) represents the corresponding absolute base word(s) that have not been formed through any word formation process. A simplex cannot be further divided into two or more word formation morphemes. For example, the participle "leteč" ('flying') has the simplex "leteti" ('to fly') rather than the noun "let" ('flight'). In cases of compound words, the simplex column contains multiple base words separated by a comma (e.g., the adjective "trikolesen" ('three-wheeled') has the simplexes "tri, kolo"). The annotations achieved high inter-annotator agreement (86.80% Krippendorff's Alpha for morphological segmentation, and 85.16% for word formation segments). This is the first publicly available Slovene dataset combining morphological segmentation, word formation segmentation, zero-morphemes, and simplex annotations in a single resource.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.subject morphology
dc.subject derivational morphology
dc.subject word formation
dc.subject manual annotation
dc.title Slovene morphological segmentation and word formation dataset KOBOS
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType lexicon
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Marko Pranjić marko.pranjic@ijs.si Jožef Stefan Institute
sponsor Slovenian Research and Innovation Agency (ARIS) P2-0103 Core research program Knowledge Technologies nationalFunds
sponsor Slovenian Research and Innovation Agency (ARIS) J6-3131 Formant combinatorics in Slovenian nationalFunds
sponsor ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
size.info 1935 entries
files.count 1
files.size 97920


 Datoteke v tem vnosu

To je vnos
Publicly Available
z licenco:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Ime
kobos-lexicon.tsv
Velikost
95.62 KB
Format
Neznano
Opis
Annotated dataset (TSV)
MD5
7c0e89045260fd94dfd54ce5f428e5e3
 Prenesi datoteko

Prikaži enostavni zapis vnosa