Frequency lists of syntactic structures from the Učbeniki 1.0 corpus

Name: Frequency lists of syntactic structures from the Učbeniki 1.0 corpus
License: https://creativecommons.org/licenses/by-sa/4.0/

Munda, Tina; Arhar Holdt, Špela; Dobrovoljc, Kaja; Kosem, Iztok; Pori, Eva; Krek, Simon

Show simple item record

dc.contributor.author	Munda, Tina
dc.contributor.author	Arhar Holdt, Špela
dc.contributor.author	Dobrovoljc, Kaja
dc.contributor.author	Kosem, Iztok
dc.contributor.author	Pori, Eva
dc.contributor.author	Krek, Simon
dc.date.accessioned	2025-01-31T16:06:30Z
dc.date.available	2025-01-31T16:06:30Z
dc.date.issued	2025-01-30
dc.identifier.uri	http://hdl.handle.net/11356/2010
dc.description	The frequency lists of syntactic structures from the Slovene textbook corpus Učbeniki 1.0 were extracted with the STARK v3 tool (http://hdl.handle.net/11356/1958). The extracted data is available at two levels: at the phrase level (see folder "besednozvezne") and at the sentence level (see folder "medstavcne"). At the phrase level, the extracted syntactic structures have a headword belonging to one of the following parts of speech, as defined by the MULTEXT-East system for morphosyntactic annotation of Slovene texts: noun (samostalnik), verb (glagol), adjective (pridevnik), adverb (prislov), pronoun (zaimek), numeral (števnik), predlog (adposition), veznik (conjunction), particle (členek), abbreviation (okrajšava) (no results were returned for interjection (medmet) and residual (neuvrščeno)). These structures were extracted based on the MULTEXT-East morphosyntax v6 (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and the JOS-SYN dependency syntax (https://wiki.cjvt.si/books/06-jos-syn-syntax), where the latter serves as a syntactic complement to the former. At the sentence level, the extracted syntactic structures link two clauses. The included types of clausal syntactic relations according to Universal Dependencies (UD) are: parataxis (soredje), coordination (priredje), and subordination (podredje), which is further divided into 4 main types according to UD: clausal subject (osebkov odvisnik), clausal object (predmetni odvisnik), adverbial cluase modifier (prislovni odvisniki), and adnominal clause modifier (prilastkov odvisnik). These structures were extracted based on the UD part-of-speech and syntactic relations annotations (https://wiki.cjvt.si/books/07-universal-dependencies). The dataset can be used for syntactic analyses in combination with comparable data (http://hdl.handle.net/11356/2009) from develpmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589), the present data representing the expected or desired scope of reception. For each part of speech (phrase level) or clausal relation (sentence level), there are 4 files: - "ucbeniki__default.tsv" - the original output, containing extracted unique syntactic structures of varying lengths, ranging from 2 to 10 tokens, arranged by frequency, followed by additional data on syntactic structures and corpus-linguistic statistics (Absolute frequency, Relative frequency, MI, MI3, Dice, logDice, t-score, simple-LL). - "ucbeniki__all-examples.tsv" - the original output, containing all matched structures found in the input corpus (i.e. all occurances of the extracted structures in every sentence). - "ucbeniki__default_tree-description.tsv" - an extension of the "ucbeniki__default.tsv" file that includes a verbal description of syntactic structures (trees). - "ucbeniki__all-examples_tree-description.tsv" - an extension of the "ucbeniki__all-examples.tsv" file that includes a verbal description of syntactic structures (trees). (The asterisk () in file names serves as a placeholder for a part of speech or a clausal relation.) The data was prepared in the following manner: The individual files of Slovene school textbooks were merged into a single CONLLU file. The corpus was already linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla/) at the levels of the MULTEXT-East v6 morphosyntax, JOS-SYN dependency syntax, and UD part-of-speech and syntactic relations annotations. Furthermore, the original corpus was preprocessed to reduce the MSD tag to its first letter (e.g., Somei → S), which denotes the part of speech (the remaining letters represent the token's morphosyntactic features). This preprocessing step enabled extraction at the part-of-speech level, disregarding token-specific features, yet still displaying the full MSD tags as nodes in the extracted structures. (Note that STARK was originally developed for extracting data from UD-parsed corpora and was not designed for use cases like this one.) Then, the data was extracted with the STARK v3.0 tool (http://hdl.handle.net/11356/1958), based on predefined parameters in the "config.ini" file, with phrase-level structures extracted based on the MULTEXT-East and JOS-SYN annotation systems, and sentence-level structures extracted based on the UD schema. The sentence-level data underwent a postprocessing phase to remove duplicates that occured due to the phased extraction of complex connectives and to recalculate corpus-linguistic statistics based on the deduplicated data. Another step was to enhance all output files with verbal descriptions of the extracted structures. Lastly, the extended versions of the two original output files ("ucbeniki__default_tree-description.tsv", "ucbeniki_*_all-examples_tree-description.tsv") were converted into Excel spreadsheets. The package also includes a configuration file for each level: "config_ucbeniki_besednozvezne.ini" for phrase-level structures, and "config_ucbeniki_medstavcne.ini" for sentence-level structures. These files contain all the parameter values used for data extraction with STARK. For more details, see "00README.txt".
dc.language.iso	slv
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.publisher	Faculty of Arts, University of Ljubljana
dc.relation.isreferencedby	https://zenodo.org/records/13936442
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/prop/en/
dc.subject	textbook corpus
dc.subject	pedagogic corpus
dc.subject	student reading
dc.subject	syntactic data
dc.subject	syntactic structures
dc.title	Frequency lists of syntactic structures from the Učbeniki 1.0 corpus
dc.type	lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType	wordList
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Tina Munda tina.munda@cjvt.si CJVT UL
sponsor	ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info	4829536 entries
files.count	1
files.size	831460397

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: ucbeniki1.0_skladenjske-strukture.zip
Size: 792.94 MB
Format: application/zip
Description: Syntactic structures from Učbeniki 1.0
MD5: 0a70a3ea8db5cc4a50d1f881e7ba13eb

Download file Preview

File Preview

ucbeniki1.0_skladenjske-strukture
- medstavcne
  - soredje
    - ucbeniki_soredje_all-examples_tree-description.tsv-1 B
    - ucbeniki_soredje_default_tree-description.xlsx-1 B
    - ucbeniki_soredje_default_tree-description.tsv-1 B
    - ucbeniki_soredje_default.tsv-1 B
    - ucbeniki_soredje_all-examples_tree-description.xlsx-1 B
    - ucbeniki_soredje_all-examples.tsv-1 B
  - podredje
    - osebkov-odv
      - ucbeniki_podredje-osebkov-odv_default.tsv-1 B
      - ucbeniki_podredje-osebkov-odv_default_tree-description.xlsx-1 B
      - ucbeniki_podredje-osebkov-odv_all-examples_tree-description.tsv-1 B
      - ucbeniki_podredje-osebkov-odv_default_tree-description.tsv-1 B
      - ucbeniki_podredje-osebkov-odv_all-examples_tree-description.xlsx-1 B
      - ucbeniki_podredje-osebkov-odv_all-examples.tsv-1 B
    - prislovni-odv
      - ucbeniki_podredje-prislovni-odv_default_tree-description.tsv-1 B
      - ucbeniki_podredje-prislovni-odv_all-examples_tree-description.tsv-1 B
      - ucbeniki_podredje-prislovni-odv_default_tree-description.xlsx-1 B
      - ucbeniki_podredje-prislovni-odv_all-examples_tree-description.xlsx-1 B
      - ucbeniki_podredje-prislovni-odv_all-examples.tsv-1 B
      - ucbeniki_podredje-prislovni-odv_default.tsv-1 B
    - prilastkov-odv
      - ucbeniki_podredje-prilastkov-odv_default_tree-description.tsv-1 B
      - ucbeniki_podredje-prilastkov-odv_all-examples.tsv-1 B
      - ucbeniki_podredje-prilastkov-odv_default.tsv-1 B
      - ucbeniki_podredje-prilastkov-odv_all-examples_tree-description.xlsx-1 B
      - ucbeniki_podredje-prilastkov-odv_all-examples_tree-description.tsv-1 B
      - ucbeniki_podredje-prilastkov-odv_default_tree-description.xlsx-1 B
    - predmetni-odv
      - ucbeniki_podredje-predmetni-odv_all-examples.tsv-1 B
      - ucbeniki_podredje-predmetni-odv_all-examples_tree-description.tsv-1 B
      - ucbeniki_podredje-predmetni-odv_default_tree-description.xlsx-1 B
      - ucbeniki_podredje-predmetni-odv_all-examples_tree-description.xlsx-1 B
      - ucbeniki_podredje-predmetni-odv_default_tree-description.tsv-1 B
      - ucbeniki_podredje-predmetni-odv_default.tsv-1 B
  - config_ucbeniki_medstavcne.ini-1 B
  - priredje
    - ucbeniki_priredje_default_tree-description.xlsx-1 B
    - ucbeniki_priredje_all-examples_tree-description.tsv-1 B
    - ucbeniki_priredje_all-examples_tree-description.xlsx-1 B
    - ucbeniki_priredje_default_tree-description.tsv-1 B
    - ucbeniki_priredje_default.tsv-1 B
    - ucbeniki_priredje_all-examples.tsv-1 B
- besednozvezne
  - medmet
    - ucbeniki_medmet_all-examples_tree-description.xlsx-1 B
    - ucbeniki_medmet_all-examples.tsv-1 B
    - ucbeniki_medmet_default_tree-description.tsv-1 B
    - ucbeniki_medmet_default.tsv-1 B
    - ucbeniki_medmet_all-examples_tree-description.tsv-1 B
    - ucbeniki_medmet_default_tree-description.xlsx-1 B
  - samostalnik
    - ucbeniki_samostalnik_all-examples_tree-description.xlsx-1 B
    - ucbeniki_samostalnik_default.tsv-1 B
    - ucbeniki_samostalnik_all-examples_tree-description.tsv-1 B
    - ucbeniki_samostalnik_default_tree-description.xlsx-1 B
    - ucbeniki_samostalnik_default_tree-description.tsv-1 B
    - ucbeniki_samostalnik_all-examples.tsv-1 B
  - predlog
    - ucbeniki_predlog_default_tree-description.xlsx-1 B
    - ucbeniki_predlog_default_tree-description.tsv-1 B
    - ucbeniki_predlog_all-examples_tree-description.tsv-1 B
    - ucbeniki_predlog_all-examples_tree-description.xlsx-1 B
    - ucbeniki_predlog_default.tsv-1 B
    - ucbeniki_predlog_all-examples.tsv-1 B
  - stevnik
    - ucbeniki_stevnik_default.tsv-1 B
    - ucbeniki_stevnik_all-examples_tree-description.xlsx-1 B
    - ucbeniki_stevnik_default_tree-description.xlsx-1 B
    - ucbeniki_stevnik_all-examples.tsv-1 B
    - ucbeniki_stevnik_default_tree-description.tsv-1 B
    - ucbeniki_stevnik_all-examples_tree-description.tsv-1 B
  - prislov
    - ucbeniki_prislov_default_tree-description.tsv-1 B
    - ucbeniki_prislov_default.tsv-1 B
    - ucbeniki_prislov_all-examples.tsv-1 B
    - ucbeniki_prislov_all-examples_tree-description.xlsx-1 B
    - ucbeniki_prislov_all-examples_tree-description.tsv-1 B
    - ucbeniki_prislov_default_tree-description.xlsx-1 B
  - veznik
    - ucbeniki_veznik_all-examples.tsv-1 B
    - ucbeniki_veznik_all-examples_tree-description.tsv-1 B
    - ucbeniki_veznik_default_tree-description.xlsx-1 B
    - ucbeniki_veznik_default.tsv-1 B
    - ucbeniki_veznik_default_tree-description.tsv-1 B
    - ucbeniki_veznik_all-examples_tree-description.xlsx-1 B
  - zaimek
    - ucbeniki_zaimek_all-examples.tsv-1 B
    - ucbeniki_zaimek_default.tsv-1 B
    - ucbeniki_zaimek_default_tree-description.tsv-1 B
    - ucbeniki_zaimek_all-examples_tree-description.tsv-1 B
    - ucbeniki_zaimek_default_tree-description.xlsx-1 B
    - ucbeniki_zaimek_all-examples_tree-description.xlsx-1 B
  - clenek
    - ucbeniki_clenek_default_tree-description.tsv-1 B
    - ucbeniki_clenek_all-examples.tsv-1 B
    - ucbeniki_clenek_default.tsv-1 B
    - ucbeniki_clenek_all-examples_tree-description.xlsx-1 B
    - ucbeniki_clenek_all-examples_tree-description.tsv-1 B
    - ucbeniki_clenek_default_tree-description.xlsx-1 B
  - okrajsava
    - ucbeniki_okrajsava_default.tsv-1 B
    - ucbeniki_okrajsava_all-examples_tree-description.xlsx-1 B
    - ucbeniki_okrajsava_default_tree-description.xlsx-1 B
    - ucbeniki_okrajsava_default_tree-description.tsv-1 B
    - ucbeniki_okrajsava_all-examples.tsv-1 B
    - ucbeniki_okrajsava_all-examples_tree-description.tsv-1 B
  - pridevnik
    - ucbeniki_pridevnik_default_tree-description.xlsx-1 B
    - ucbeniki_pridevnik_all-examples_tree-description.tsv-1 B
    - ucbeniki_pridevnik_all-examples_tree-description.xlsx-1 B
    - ucbeniki_pridevnik_all-examples.tsv-1 B
    - ucbeniki_pridevnik_default_tree-description.tsv-1 B
    - ucbeniki_pridevnik_default.tsv-1 B
  - config_ucbeniki_besednozvezne.ini-1 B
  - neuvrsceno
    - ucbeniki_neuvrsceno_all-examples_tree-description.tsv-1 B
    - ucbeniki_neuvrsceno_default.tsv-1 B
    - ucbeniki_neuvrsceno_all-examples_tree-description.xlsx-1 B
    - ucbeniki_neuvrsceno_default_tree-description.tsv-1 B
    - ucbeniki_neuvrsceno_all-examples.tsv-1 B
    - ucbeniki_neuvrsceno_default_tree-description.xlsx-1 B
  - glagol
    - ucbeniki_glagol_all-examples_tree-description.tsv-1 B
    - ucbeniki_glagol_default_tree-description.xlsx-1 B
    - ucbeniki_glagol_all-examples_tree-description.xlsx-1 B
    - ucbeniki_glagol_all-examples.tsv-1 B
    - ucbeniki_glagol_default_tree-description.tsv-1 B
    - ucbeniki_glagol_default.tsv-1 B
- 00README.txt-1 B

Show simple item record

Files in this item

Partners

Partners

Repository