Syntactic Tree Inventories from English GUM UD Corpus (v2.15)

Name: Syntactic Tree Inventories from English GUM UD Corpus (v2.15)
License: https://creativecommons.org/licenses/by/4.0/

Dobrovoljc, Kaja

Show simple item record

dc.contributor.author	Dobrovoljc, Kaja
dc.date.accessioned	2025-05-27T14:56:27Z
dc.date.available	2025-05-27T14:56:27Z
dc.date.issued	2025-05-26
dc.identifier.uri	http://hdl.handle.net/11356/2036
dc.description	This dataset contains lists of delexicalized dependency trees and subtrees extracted from the English UD GUM corpus, version 2.15 (http://hdl.handle.net/11234/1-5787), using the STARK tool (https://github.com/clarinsi/STARK). These lists represent a basic inventory of syntactic structures in English, supporting data-driven investigations into syntactic patterns and their variation across modalities. The GUM corpus was divided into spoken and written subsets based on the original genre classifications. The spoken subset includes interviews, conversations, podcasts, vlogs, courtroom transcripts, and speeches, while the written subset includes news articles, academic texts, fiction, how-to guides, biographies, essays, letters, textbooks, and travel guides. Each structure is represented as a fixed-order labeled dependency tree or subtree with UPOS tags as nodes (e.g., ADJ <amod NOUN). For each of the two subcorpora (spoken and written), structures were extracted in three versions (1) The full version (2) A version excluding punctuation (i.e., branches labeled as punct) (3) A version excluding disfluencies (i.e., branches labeled as punct, reparandum, or discourse) The extracted structures are provided in tabular TSV format. Each row contains: * The delexicalized tree/subtree (e.g., ADJ <amod NOUN) * Its absolute and relative frequency in the target corpus (e.g., GUM-spoken) * An example (e.g., nice <amod example) * Frequency in the corresponding reference corpus (e.g., GUM-written) * Keyness measures for modality-based comparison (e.g., LL, Odds Ratio, %DIFF)
dc.language.iso	eng
dc.publisher	Faculty of Arts, University of Ljubljana
dc.publisher	Jožef Stefan Institute
dc.relation.isreferencedby	https://doi.org/10.1515/cllt-2025-0046
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.rights.label	PUB
dc.source.uri	https://spot.ff.uni-lj.si/
dc.subject	dependency trees
dc.subject	dependency treebank
dc.subject	keyword extraction
dc.subject	corpus linguistics
dc.subject	written corpus
dc.subject	spoken corpus
dc.title	Syntactic Tree Inventories from English GUM UD Corpus (v2.15)
dc.type	lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType	wordList
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Kaja Dobrovoljc kaja.dobrovoljc@ijs.si Jožef Stefan Institute
sponsor	ARRS (Slovenian Research Agency) Z6-4617 Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
files.count	6
files.size	44445314