Show simple item record

 
dc.contributor.author Dobrovoljc, Kaja
dc.date.accessioned 2025-05-24T14:32:13Z
dc.date.available 2025-05-24T14:32:13Z
dc.date.issued 2025-05-23
dc.identifier.uri http://hdl.handle.net/11356/2035
dc.description This dataset contains lists of delexicalized dependency trees and subtrees extracted from the Slovenian UD corpora SSJ (written) and SST (spoken), version 2.15 (http://hdl.handle.net/11234/1-5787), using the STARK tool (https://github.com/clarinsi/STARK). These lists represent a basic set of syntactic structures in Slovenian, useful for data-based investigations of syntactic patterns in Slovenian and their variation across the two modalities. Each structure is represented as a fixed-order labeled dependency tree or subtree with UPOS tags as nodes (e.g., ADJ <amod NOUN). Structures were extracted from three versions of each corpus: (1) The full version (2) A version excluding punctuation (i.e., branches labeled as punct) (3) A version excluding disfluencies (i.e., branches labeled as punct, reparandum, or discourse) The extracted structures are provided in tabular TSV format. Each row contains: * The delexicalized tree/subtree (e.g., ADJ <amod NOUN) * Its absolute and relative frequency in the target corpus (e.g., spoken SST) * An example (e.g., samostojna <amod država) * Frequency in the corresponding reference corpus (e.g., written SSJ) * Keyness measures for modality-based comparison (e.g., LL, Odds Ratio, %DIFF) The STARK configuration file used in the extraction process is included.
dc.language.iso slv
dc.publisher Faculty of Arts, University of Ljubljana
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://spot.ff.uni-lj.si/
dc.subject dependency trees
dc.subject dependency treebank
dc.subject keyword extraction
dc.subject corpus linguistics
dc.subject written corpus
dc.subject spoken corpus
dc.title Syntactic Tree Inventories from Slovenian UD Corpora (v2.15)
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Kaja Dobrovoljc kaja.dobrovoljc@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) Z6-4617 Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
files.count 6
files.size 77720157


 Files in this item

 Download all files in item (74.12 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
SSJ-2.15_no-punct.tsv
Size
17.32 MB
Format
Unknown
Description
Written trees (all)
MD5
1513725f06137adf45fcf098ba194a71
 Download file
Icon
Name
SSJ-2.15_all.tsv
Size
19.63 MB
Format
Unknown
Description
Written trees excluding punctuation
MD5
1593a6b16fa7c56e2bfe954587cfdb87
 Download file
Icon
Name
SSJ-2.15_no-disfl.tsv
Size
17.3 MB
Format
Unknown
Description
Written trees excluding disfluencies
MD5
2ed6437caf198c4cb0295dc0016c2680
 Download file
Icon
Name
SST-2.15_all.tsv
Size
7.69 MB
Format
Unknown
Description
Spoken trees (all)
MD5
8711a8935755d6c56d7ce8bcc14ce440
 Download file
Icon
Name
SST-2.15_no-punct.tsv
Size
6.45 MB
Format
Unknown
Description
Spoken trees excluding punctuation
MD5
4eb1507d78f3624b8e1078d71c3442d7
 Download file
Icon
Name
SST-2.15_no-disfl.tsv
Size
5.74 MB
Format
Unknown
Description
Spoken trees excluding disfluencies
MD5
657f5d9539f057eae644bf8847ec3048
 Download file

Show simple item record