dc.contributor.author | Dobrovoljc, Kaja |
dc.date.accessioned | 2025-05-24T14:32:13Z |
dc.date.available | 2025-05-24T14:32:13Z |
dc.date.issued | 2025-05-23 |
dc.identifier.uri | http://hdl.handle.net/11356/2035 |
dc.description | This dataset contains lists of delexicalized dependency trees and subtrees extracted from the Slovenian UD corpora SSJ (written) and SST (spoken), version 2.15 (http://hdl.handle.net/11234/1-5787), using the STARK tool (https://github.com/clarinsi/STARK). These lists represent a basic set of syntactic structures in Slovenian, useful for data-based investigations of syntactic patterns in Slovenian and their variation across the two modalities. Each structure is represented as a fixed-order labeled dependency tree or subtree with UPOS tags as nodes (e.g., ADJ <amod NOUN). Structures were extracted from three versions of each corpus: (1) The full version (2) A version excluding punctuation (i.e., branches labeled as punct) (3) A version excluding disfluencies (i.e., branches labeled as punct, reparandum, or discourse) The extracted structures are provided in tabular TSV format. Each row contains: * The delexicalized tree/subtree (e.g., ADJ <amod NOUN) * Its absolute and relative frequency in the target corpus (e.g., spoken SST) * An example (e.g., samostojna <amod država) * Frequency in the corresponding reference corpus (e.g., written SSJ) * Keyness measures for modality-based comparison (e.g., LL, Odds Ratio, %DIFF) The STARK configuration file used in the extraction process is included. |
dc.language.iso | slv |
dc.publisher | Faculty of Arts, University of Ljubljana |
dc.publisher | Jožef Stefan Institute |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://spot.ff.uni-lj.si/ |
dc.subject | dependency trees |
dc.subject | dependency treebank |
dc.subject | keyword extraction |
dc.subject | corpus linguistics |
dc.subject | written corpus |
dc.subject | spoken corpus |
dc.title | Syntactic Tree Inventories from Slovenian UD Corpora (v2.15) |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | wordList |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Kaja Dobrovoljc kaja.dobrovoljc@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) Z6-4617 Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
files.count | 6 |
files.size | 77720157 |
Files in this item
Download all files in item (74.12 MB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)



- Name
- SSJ-2.15_no-punct.tsv
- Size
- 17.32 MB
- Format
- Unknown
- Description
- Written trees (all)
- MD5
- 1513725f06137adf45fcf098ba194a71

- Name
- SSJ-2.15_all.tsv
- Size
- 19.63 MB
- Format
- Unknown
- Description
- Written trees excluding punctuation
- MD5
- 1593a6b16fa7c56e2bfe954587cfdb87

- Name
- SSJ-2.15_no-disfl.tsv
- Size
- 17.3 MB
- Format
- Unknown
- Description
- Written trees excluding disfluencies
- MD5
- 2ed6437caf198c4cb0295dc0016c2680

- Name
- SST-2.15_all.tsv
- Size
- 7.69 MB
- Format
- Unknown
- Description
- Spoken trees (all)
- MD5
- 8711a8935755d6c56d7ce8bcc14ce440

- Name
- SST-2.15_no-punct.tsv
- Size
- 6.45 MB
- Format
- Unknown
- Description
- Spoken trees excluding punctuation
- MD5
- 4eb1507d78f3624b8e1078d71c3442d7

- Name
- SST-2.15_no-disfl.tsv
- Size
- 5.74 MB
- Format
- Unknown
- Description
- Spoken trees excluding disfluencies
- MD5
- 657f5d9539f057eae644bf8847ec3048