Show simple item record

 
dc.contributor.author Dobrovoljc, Kaja
dc.date.accessioned 2025-05-27T14:56:27Z
dc.date.available 2025-05-27T14:56:27Z
dc.date.issued 2025-05-26
dc.identifier.uri http://hdl.handle.net/11356/2036
dc.description This dataset contains lists of delexicalized dependency trees and subtrees extracted from the English UD GUM corpus, version 2.15 (http://hdl.handle.net/11234/1-5787), using the STARK tool (https://github.com/clarinsi/STARK). These lists represent a basic inventory of syntactic structures in English, supporting data-driven investigations into syntactic patterns and their variation across modalities. The GUM corpus was divided into spoken and written subsets based on the original genre classifications. The spoken subset includes interviews, conversations, podcasts, vlogs, courtroom transcripts, and speeches, while the written subset includes news articles, academic texts, fiction, how-to guides, biographies, essays, letters, textbooks, and travel guides. Each structure is represented as a fixed-order labeled dependency tree or subtree with UPOS tags as nodes (e.g., ADJ <amod NOUN). For each of the two subcorpora (spoken and written), structures were extracted in three versions (1) The full version (2) A version excluding punctuation (i.e., branches labeled as punct) (3) A version excluding disfluencies (i.e., branches labeled as punct, reparandum, or discourse) The extracted structures are provided in tabular TSV format. Each row contains: * The delexicalized tree/subtree (e.g., ADJ <amod NOUN) * Its absolute and relative frequency in the target corpus (e.g., GUM-spoken) * An example (e.g., nice <amod example) * Frequency in the corresponding reference corpus (e.g., GUM-written) * Keyness measures for modality-based comparison (e.g., LL, Odds Ratio, %DIFF)
dc.language.iso eng
dc.publisher Faculty of Arts, University of Ljubljana
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.source.uri https://spot.ff.uni-lj.si/
dc.subject dependency trees
dc.subject dependency treebank
dc.subject keyword extraction
dc.subject corpus linguistics
dc.subject written corpus
dc.subject spoken corpus
dc.title Syntactic Tree Inventories from English GUM UD Corpus (v2.15)
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Kaja Dobrovoljc kaja.dobrovoljc@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) Z6-4617 Treebank-Driven Approach to the Study of Spoken Slovenian nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
files.count 6
files.size 44445314


 Files in this item

 Download all files in item (42.39 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
GUM-written-2.15_all.tsv
Size
9.84 MB
Format
Unknown
Description
Written trees (all)
MD5
77ab28c0af43f63ea312e72735ee0ac5
 Download file
Icon
Name
GUM-written-2.14_no-punct.tsv
Size
8.69 MB
Format
Unknown
Description
Written trees excluding punctuation
MD5
03d655850c94f991e7d3991d03192348
 Download file
Icon
Name
GUM-written-2.15_no-disfl.tsv
Size
8.67 MB
Format
Unknown
Description
Written trees excluding disfluencies
MD5
80cb17edeed0769c4745b01bd76aad7f
 Download file
Icon
Name
GUM-spoken-2.15_all.tsv
Size
5.55 MB
Format
Unknown
Description
Spoken trees (all)
MD5
0ec02778a43358d1a5ceb6b18a1165f8
 Download file
Icon
Name
GUM-spoken-2.15_no-punct.tsv
Size
4.93 MB
Format
Unknown
Description
Spoken trees excluding punctuation
MD5
67d1258578bd10122ca765d1e3c285a8
 Download file
Icon
Name
GUM-spoken-2.15_no-disfl.tsv
Size
4.7 MB
Format
Unknown
Description
Spoken trees excluding disfluencies
MD5
c1f4e5f8e92eb333ecc6151adbfc8de9
 Download file

Show simple item record