dc.contributor.author | Dobrovoljc, Kaja |
dc.date.accessioned | 2025-05-27T14:56:27Z |
dc.date.available | 2025-05-27T14:56:27Z |
dc.date.issued | 2025-05-26 |
dc.identifier.uri | http://hdl.handle.net/11356/2036 |
dc.description | This dataset contains lists of delexicalized dependency trees and subtrees extracted from the English UD GUM corpus, version 2.15 (http://hdl.handle.net/11234/1-5787), using the STARK tool (https://github.com/clarinsi/STARK). These lists represent a basic inventory of syntactic structures in English, supporting data-driven investigations into syntactic patterns and their variation across modalities. The GUM corpus was divided into spoken and written subsets based on the original genre classifications. The spoken subset includes interviews, conversations, podcasts, vlogs, courtroom transcripts, and speeches, while the written subset includes news articles, academic texts, fiction, how-to guides, biographies, essays, letters, textbooks, and travel guides. Each structure is represented as a fixed-order labeled dependency tree or subtree with UPOS tags as nodes (e.g., ADJ <amod NOUN). For each of the two subcorpora (spoken and written), structures were extracted in three versions (1) The full version (2) A version excluding punctuation (i.e., branches labeled as punct) (3) A version excluding disfluencies (i.e., branches labeled as punct, reparandum, or discourse) The extracted structures are provided in tabular TSV format. Each row contains: * The delexicalized tree/subtree (e.g., ADJ <amod NOUN) * Its absolute and relative frequency in the target corpus (e.g., GUM-spoken) * An example (e.g., nice <amod example) * Frequency in the corresponding reference corpus (e.g., GUM-written) * Keyness measures for modality-based comparison (e.g., LL, Odds Ratio, %DIFF) |
dc.language.iso | eng |
dc.publisher | Faculty of Arts, University of Ljubljana |
dc.publisher | Jožef Stefan Institute |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://spot.ff.uni-lj.si/ |
dc.subject | dependency trees |
dc.subject | dependency treebank |
dc.subject | keyword extraction |
dc.subject | corpus linguistics |
dc.subject | written corpus |
dc.subject | spoken corpus |
dc.title | Syntactic Tree Inventories from English GUM UD Corpus (v2.15) |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | wordList |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Kaja Dobrovoljc kaja.dobrovoljc@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) Z6-4617 Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
files.count | 6 |
files.size | 44445314 |
Files in this item
Download all files in item (42.39 MB)This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)



- Name
- GUM-written-2.15_all.tsv
- Size
- 9.84 MB
- Format
- Unknown
- Description
- Written trees (all)
- MD5
- 77ab28c0af43f63ea312e72735ee0ac5

- Name
- GUM-written-2.14_no-punct.tsv
- Size
- 8.69 MB
- Format
- Unknown
- Description
- Written trees excluding punctuation
- MD5
- 03d655850c94f991e7d3991d03192348

- Name
- GUM-written-2.15_no-disfl.tsv
- Size
- 8.67 MB
- Format
- Unknown
- Description
- Written trees excluding disfluencies
- MD5
- 80cb17edeed0769c4745b01bd76aad7f

- Name
- GUM-spoken-2.15_all.tsv
- Size
- 5.55 MB
- Format
- Unknown
- Description
- Spoken trees (all)
- MD5
- 0ec02778a43358d1a5ceb6b18a1165f8

- Name
- GUM-spoken-2.15_no-punct.tsv
- Size
- 4.93 MB
- Format
- Unknown
- Description
- Spoken trees excluding punctuation
- MD5
- 67d1258578bd10122ca765d1e3c285a8

- Name
- GUM-spoken-2.15_no-disfl.tsv
- Size
- 4.7 MB
- Format
- Unknown
- Description
- Spoken trees excluding disfluencies
- MD5
- c1f4e5f8e92eb333ecc6151adbfc8de9