| dc.contributor.author | Pranjić, Marko |
| dc.contributor.author | Kern, Boris |
| dc.contributor.author | Voršič, Ines |
| dc.contributor.author | Pollak, Senja |
| dc.date.accessioned | 2026-03-09T15:54:39Z |
| dc.date.available | 2026-03-09T15:54:39Z |
| dc.date.issued | 2026-03-13 |
| dc.identifier.uri | http://hdl.handle.net/11356/2060 |
| dc.description | This dataset provides word-level multidimensional morphological annotations for Slovene, containing 1,935 entries manually annotated by two domain experts. The target words in the dataset were sampled from Sloleks 3.0 to provide data for morphological analysis, computational modeling, and linguistic research. The dataset is formatted as a lexicon (.tsv) containing five columns: 1. word: the target word 2. part_of_speech: the part-of-speech tag (noun, verb, adjective, adverb, or particle) 3. morphological_segments: all surface-level morphemes 4. word_formation_segments: derivational morphemes only 5. simplex: the base word(s) The dataset captures three distinct dimensions of morphological analysis, which are defined as follows: Morphological segments (the 'morphological_segments' column) identify all surface-level morphemes in a word, including both derivational and inflectional affixes. This segmentation describes how a word is modified to fit its grammatical role (such as encoding case, gender, and number). Word formation segments (the 'word_formation_segments' column) focus exclusively on the derivational processes used to create new words. Because inflectional morphology is a separate process that only modifies existing words, inflectional endings are excluded from word formation segments. For example, the adjective "nepozidan" ('not built-up') has the morphological segmentation "ne-po-zid-a-n-0" (capturing the inflectional state), whereas its word formation segmentation is "ne-po-zida-n", reflecting its specific derivational chain (zidati -> pozidati -> pozidan -> nepozidan). Zero-morphemes are integrated directly into both segmentation columns (represented by the character "0"). A zero-morpheme represents a morpheme without a phonetic form that is used to mark grammatical distinctions not explicitly realized in speech. It can function as both an inflectional morpheme (e.g., marking nominative masculine nouns that lack an explicit suffix) and a word formation morpheme necessary for deriving a specific part of speech from a base word. Simplex (the 'simplex' column) represents the corresponding absolute base word(s) that have not been formed through any word formation process. A simplex cannot be further divided into two or more word formation morphemes. For example, the participle "leteč" ('flying') has the simplex "leteti" ('to fly') rather than the noun "let" ('flight'). In cases of compound words, the simplex column contains multiple base words separated by a comma (e.g., the adjective "trikolesen" ('three-wheeled') has the simplexes "tri, kolo"). The annotations achieved high inter-annotator agreement (86.80% Krippendorff's Alpha for morphological segmentation, and 85.16% for word formation segments). This is the first publicly available Slovene dataset combining morphological segmentation, word formation segmentation, zero-morphemes, and simplex annotations in a single resource. |
| dc.language.iso | slv |
| dc.publisher | Jožef Stefan Institute |
| dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
| dc.rights.label | PUB |
| dc.subject | morphology |
| dc.subject | derivational morphology |
| dc.subject | word formation |
| dc.subject | manual annotation |
| dc.title | Slovene morphological segmentation and word formation dataset KOBOS |
| dc.type | lexicalConceptualResource |
| metashare.ResourceInfo#ContentInfo.detailedType | lexicon |
| metashare.ResourceInfo#ContentInfo.mediaType | text |
| has.files | yes |
| branding | CLARIN.SI data & tools |
| contact.person | Marko Pranjić marko.pranjic@ijs.si Jožef Stefan Institute |
| sponsor | Slovenian Research and Innovation Agency (ARIS) P2-0103 Core research program Knowledge Technologies nationalFunds |
| sponsor | Slovenian Research and Innovation Agency (ARIS) J6-3131 Formant combinatorics in Slovenian nationalFunds |
| sponsor | ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds |
| size.info | 1935 entries |
| files.count | 1 |
| files.size | 97920 |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
z licenco:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Ime
- kobos-lexicon.tsv
- Velikost
- 95.62 KB
- Format
- Neznano
- Opis
- Annotated dataset (TSV)
- MD5
- 7c0e89045260fd94dfd54ce5f428e5e3