Slovene morphological segmentation and word formation dataset KOBOS

Name: Slovene morphological segmentation and word formation dataset KOBOS
License: https://creativecommons.org/licenses/by/4.0/

Pranjić, Marko; Kern, Boris; Voršič, Ines; Pollak, Senja

Prikaži enostavni zapis vnosa

dc.contributor.author	Pranjić, Marko
dc.contributor.author	Kern, Boris
dc.contributor.author	Voršič, Ines
dc.contributor.author	Pollak, Senja
dc.date.accessioned	2026-03-09T15:54:39Z
dc.date.available	2026-03-09T15:54:39Z
dc.date.issued	2026-03-13
dc.identifier.uri	http://hdl.handle.net/11356/2060
dc.description	This dataset provides word-level multidimensional morphological annotations for Slovene, containing 1,935 entries manually annotated by two domain experts. The target words in the dataset were sampled from Sloleks 3.0 to provide data for morphological analysis, computational modeling, and linguistic research. The dataset is formatted as a lexicon (.tsv) containing five columns: 1. word: the target word 2. part_of_speech: the part-of-speech tag (noun, verb, adjective, adverb, or particle) 3. morphological_segments: all surface-level morphemes 4. word_formation_segments: derivational morphemes only 5. simplex: the base word(s) The dataset captures three distinct dimensions of morphological analysis, which are defined as follows: Morphological segments (the 'morphological_segments' column) identify all surface-level morphemes in a word, including both derivational and inflectional affixes. This segmentation describes how a word is modified to fit its grammatical role (such as encoding case, gender, and number). Word formation segments (the 'word_formation_segments' column) focus exclusively on the derivational processes used to create new words. Because inflectional morphology is a separate process that only modifies existing words, inflectional endings are excluded from word formation segments. For example, the adjective "nepozidan" ('not built-up') has the morphological segmentation "ne-po-zid-a-n-0" (capturing the inflectional state), whereas its word formation segmentation is "ne-po-zida-n", reflecting its specific derivational chain (zidati -> pozidati -> pozidan -> nepozidan). Zero-morphemes are integrated directly into both segmentation columns (represented by the character "0"). A zero-morpheme represents a morpheme without a phonetic form that is used to mark grammatical distinctions not explicitly realized in speech. It can function as both an inflectional morpheme (e.g., marking nominative masculine nouns that lack an explicit suffix) and a word formation morpheme necessary for deriving a specific part of speech from a base word. Simplex (the 'simplex' column) represents the corresponding absolute base word(s) that have not been formed through any word formation process. A simplex cannot be further divided into two or more word formation morphemes. For example, the participle "leteč" ('flying') has the simplex "leteti" ('to fly') rather than the noun "let" ('flight'). In cases of compound words, the simplex column contains multiple base words separated by a comma (e.g., the adjective "trikolesen" ('three-wheeled') has the simplexes "tri, kolo"). The annotations achieved high inter-annotator agreement (86.80% Krippendorff's Alpha for morphological segmentation, and 85.16% for word formation segments). This is the first publicly available Slovene dataset combining morphological segmentation, word formation segmentation, zero-morphemes, and simplex annotations in a single resource.
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.subject	morphology
dc.subject	derivational morphology
dc.subject	word formation
dc.subject	manual annotation
dc.title	Slovene morphological segmentation and word formation dataset KOBOS
dc.type	lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType	lexicon
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Marko Pranjić marko.pranjic@ijs.si Jožef Stefan Institute
sponsor	Slovenian Research and Innovation Agency (ARIS) P2-0103 Core research program Knowledge Technologies nationalFunds
sponsor	Slovenian Research and Innovation Agency (ARIS) J6-3131 Formant combinatorics in Slovenian nationalFunds
sponsor	ARIS (Slovenian Research and Innovation Agency) GC-0002 LLM4DH: Large Language Models for Digital Humanities nationalFunds
size.info	1935 entries
files.count	1
files.size	97920

Datoteke v tem vnosu

To je vnos

Publicly Available

z licenco:
Creative Commons - Attribution 4.0 International (CC BY 4.0)

Ime: kobos-lexicon.tsv
Velikost: 95.62 KB
Format: Neznano
Opis: Annotated dataset (TSV)
MD5: 7c0e89045260fd94dfd54ce5f428e5e3

Prenesi datoteko

Prikaži enostavni zapis vnosa

Datoteke v tem vnosu

Partnerji

Partnerji

Repozitorij