Syntactic Tree Inventories from Slovenian UD Corpora (v2.15)

Name: Syntactic Tree Inventories from Slovenian UD Corpora (v2.15)
License: https://creativecommons.org/licenses/by/4.0/

Dobrovoljc, Kaja

Show simple item record

dc.contributor.author	Dobrovoljc, Kaja
dc.date.accessioned	2025-05-24T14:32:13Z
dc.date.available	2025-05-24T14:32:13Z
dc.date.issued	2025-05-23
dc.identifier.uri	http://hdl.handle.net/11356/2035
dc.description	This dataset contains lists of delexicalized dependency trees and subtrees extracted from the Slovenian UD corpora SSJ (written) and SST (spoken), version 2.15 (http://hdl.handle.net/11234/1-5787), using the STARK tool (https://github.com/clarinsi/STARK). These lists represent a basic set of syntactic structures in Slovenian, useful for data-based investigations of syntactic patterns in Slovenian and their variation across the two modalities. Each structure is represented as a fixed-order labeled dependency tree or subtree with UPOS tags as nodes (e.g., ADJ <amod NOUN). Structures were extracted from three versions of each corpus: (1) The full version (2) A version excluding punctuation (i.e., branches labeled as punct) (3) A version excluding disfluencies (i.e., branches labeled as punct, reparandum, or discourse) The extracted structures are provided in tabular TSV format. Each row contains: * The delexicalized tree/subtree (e.g., ADJ <amod NOUN) * Its absolute and relative frequency in the target corpus (e.g., spoken SST) * An example (e.g., samostojna <amod država) * Frequency in the corresponding reference corpus (e.g., written SSJ) * Keyness measures for modality-based comparison (e.g., LL, Odds Ratio, %DIFF) The STARK configuration file used in the extraction process is included.
dc.language.iso	slv
dc.publisher	Faculty of Arts, University of Ljubljana
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://doi.org/10.1515/cllt-2025-0046
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.source.uri	https://spot.ff.uni-lj.si/
dc.subject	dependency trees
dc.subject	dependency treebank
dc.subject	keyword extraction
dc.subject	corpus linguistics
dc.subject	written corpus
dc.subject	spoken corpus
dc.title	Syntactic Tree Inventories from Slovenian UD Corpora (v2.15)
dc.type	lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType	wordList
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Kaja Dobrovoljc kaja.dobrovoljc@ijs.si Jožef Stefan Institute
sponsor	ARRS (Slovenian Research Agency) Z6-4617 Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
files.count	6
files.size	77720157