Show simple item record

 
dc.contributor.author Munda, Tina
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Kosem, Iztok
dc.contributor.author Pori, Eva
dc.contributor.author Krek, Simon
dc.date.accessioned 2025-01-31T16:06:30Z
dc.date.available 2025-01-31T16:06:30Z
dc.date.issued 2025-01-30
dc.identifier.uri http://hdl.handle.net/11356/2010
dc.description The frequency lists of syntactic structures from the Slovene textbook corpus Učbeniki 1.0 were extracted with the STARK v3 tool (http://hdl.handle.net/11356/1958). The extracted data is available at two levels: at the phrase level (see folder "besednozvezne") and at the sentence level (see folder "medstavcne"). At the phrase level, the extracted syntactic structures have a headword belonging to one of the following parts of speech, as defined by the MULTEXT-East system for morphosyntactic annotation of Slovene texts: noun (samostalnik), verb (glagol), adjective (pridevnik), adverb (prislov), pronoun (zaimek), numeral (števnik), predlog (adposition), veznik (conjunction), particle (členek), abbreviation (okrajšava) (no results were returned for interjection (medmet) and residual (neuvrščeno)). These structures were extracted based on the MULTEXT-East morphosyntax v6 (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and the JOS-SYN dependency syntax (https://wiki.cjvt.si/books/06-jos-syn-syntax), where the latter serves as a syntactic complement to the former. At the sentence level, the extracted syntactic structures link two clauses. The included types of clausal syntactic relations according to Universal Dependencies (UD) are: parataxis (soredje), coordination (priredje), and subordination (podredje), which is further divided into 4 main types according to UD: clausal subject (osebkov odvisnik), clausal object (predmetni odvisnik), adverbial cluase modifier (prislovni odvisniki), and adnominal clause modifier (prilastkov odvisnik). These structures were extracted based on the UD part-of-speech and syntactic relations annotations (https://wiki.cjvt.si/books/07-universal-dependencies). The dataset can be used for syntactic analyses in combination with comparable data (http://hdl.handle.net/11356/2009) from develpmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589), the present data representing the expected or desired scope of reception. For each part of speech (phrase level) or clausal relation (sentence level), there are 4 files: - "ucbeniki_*_default.tsv" - the original output, containing extracted unique syntactic structures of varying lengths, ranging from 2 to 10 tokens, arranged by frequency, followed by additional data on syntactic structures and corpus-linguistic statistics (Absolute frequency, Relative frequency, MI, MI3, Dice, logDice, t-score, simple-LL). - "ucbeniki_*_all-examples.tsv" - the original output, containing all matched structures found in the input corpus (i.e. all occurances of the extracted structures in every sentence). - "ucbeniki_*_default_tree-description.tsv" - an extension of the "ucbeniki_*_default.tsv" file that includes a verbal description of syntactic structures (trees). - "ucbeniki_*_all-examples_tree-description.tsv" - an extension of the "ucbeniki_*_all-examples.tsv" file that includes a verbal description of syntactic structures (trees). (The asterisk (*) in file names serves as a placeholder for a part of speech or a clausal relation.) The data was prepared in the following manner: The individual files of Slovene school textbooks were merged into a single CONLLU file. The corpus was already linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla/) at the levels of the MULTEXT-East v6 morphosyntax, JOS-SYN dependency syntax, and UD part-of-speech and syntactic relations annotations. Furthermore, the original corpus was preprocessed to reduce the MSD tag to its first letter (e.g., Somei → S), which denotes the part of speech (the remaining letters represent the token's morphosyntactic features). This preprocessing step enabled extraction at the part-of-speech level, disregarding token-specific features, yet still displaying the full MSD tags as nodes in the extracted structures. (Note that STARK was originally developed for extracting data from UD-parsed corpora and was not designed for use cases like this one.) Then, the data was extracted with the STARK v3.0 tool (http://hdl.handle.net/11356/1958), based on predefined parameters in the "config.ini" file, with phrase-level structures extracted based on the MULTEXT-East and JOS-SYN annotation systems, and sentence-level structures extracted based on the UD schema. The sentence-level data underwent a postprocessing phase to remove duplicates that occured due to the phased extraction of complex connectives and to recalculate corpus-linguistic statistics based on the deduplicated data. Another step was to enhance all output files with verbal descriptions of the extracted structures. Lastly, the extended versions of the two original output files ("ucbeniki_*_default_tree-description.tsv", "ucbeniki_*_all-examples_tree-description.tsv") were converted into Excel spreadsheets. The package also includes a configuration file for each level: "config_ucbeniki_besednozvezne.ini" for phrase-level structures, and "config_ucbeniki_medstavcne.ini" for sentence-level structures. These files contain all the parameter values used for data extraction with STARK. For more details, see "00README.txt".
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.publisher Faculty of Arts, University of Ljubljana
dc.relation.isreferencedby https://zenodo.org/records/13936442
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.cjvt.si/prop/en/
dc.subject textbook corpus
dc.subject pedagogic corpus
dc.subject student reading
dc.subject syntactic data
dc.subject syntactic structures
dc.title Frequency lists of syntactic structures from the Učbeniki 1.0 corpus
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tina Munda tina.munda@cjvt.si CJVT UL
sponsor ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 4829536 entries
files.count 1
files.size 831460397


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
ucbeniki1.0_skladenjske-strukture.zip
Size
792.94 MB
Format
application/zip
Description
Syntactic structures from Učbeniki 1.0
MD5
0a70a3ea8db5cc4a50d1f881e7ba13eb
 Download file  Preview
 File Preview  
  • ucbeniki1.0_skladenjske-strukture
    • medstavcne
      • soredje
        • ucbeniki_soredje_all-examples_tree-description.tsv-1 B
        • ucbeniki_soredje_default_tree-description.xlsx-1 B
        • ucbeniki_soredje_default_tree-description.tsv-1 B
        • ucbeniki_soredje_default.tsv-1 B
        • ucbeniki_soredje_all-examples_tree-description.xlsx-1 B
        • ucbeniki_soredje_all-examples.tsv-1 B
      • podredje
        • osebkov-odv
          • ucbeniki_podredje-osebkov-odv_default.tsv-1 B
          • ucbeniki_podredje-osebkov-odv_default_tree-description.xlsx-1 B
          • ucbeniki_podredje-osebkov-odv_all-examples_tree-description.tsv-1 B
          • ucbeniki_podredje-osebkov-odv_default_tree-description.tsv-1 B
          • ucbeniki_podredje-osebkov-odv_all-examples_tree-description.xlsx-1 B
          • ucbeniki_podredje-osebkov-odv_all-examples.tsv-1 B
        • prislovni-odv
          • ucbeniki_podredje-prislovni-odv_default_tree-description.tsv-1 B
          • ucbeniki_podredje-prislovni-odv_all-examples_tree-description.tsv-1 B
          • ucbeniki_podredje-prislovni-odv_default_tree-description.xlsx-1 B
          • ucbeniki_podredje-prislovni-odv_all-examples_tree-description.xlsx-1 B
          • ucbeniki_podredje-prislovni-odv_all-examples.tsv-1 B
          • ucbeniki_podredje-prislovni-odv_default.tsv-1 B
        • prilastkov-odv
          • ucbeniki_podredje-prilastkov-odv_default_tree-description.tsv-1 B
          • ucbeniki_podredje-prilastkov-odv_all-examples.tsv-1 B
          • ucbeniki_podredje-prilastkov-odv_default.tsv-1 B
          • ucbeniki_podredje-prilastkov-odv_all-examples_tree-description.xlsx-1 B
          • ucbeniki_podredje-prilastkov-odv_all-examples_tree-description.tsv-1 B
          • ucbeniki_podredje-prilastkov-odv_default_tree-description.xlsx-1 B
        • predmetni-odv
          • ucbeniki_podredje-predmetni-odv_all-examples.tsv-1 B
          • ucbeniki_podredje-predmetni-odv_all-examples_tree-description.tsv-1 B
          • ucbeniki_podredje-predmetni-odv_default_tree-description.xlsx-1 B
          • ucbeniki_podredje-predmetni-odv_all-examples_tree-description.xlsx-1 B
          • ucbeniki_podredje-predmetni-odv_default_tree-description.tsv-1 B
          • ucbeniki_podredje-predmetni-odv_default.tsv-1 B
      • config_ucbeniki_medstavcne.ini-1 B
      • priredje
        • ucbeniki_priredje_default_tree-description.xlsx-1 B
        • ucbeniki_priredje_all-examples_tree-description.tsv-1 B
        • ucbeniki_priredje_all-examples_tree-description.xlsx-1 B
        • ucbeniki_priredje_default_tree-description.tsv-1 B
        • ucbeniki_priredje_default.tsv-1 B
        • ucbeniki_priredje_all-examples.tsv-1 B
    • besednozvezne
      • medmet
        • ucbeniki_medmet_all-examples_tree-description.xlsx-1 B
        • ucbeniki_medmet_all-examples.tsv-1 B
        • ucbeniki_medmet_default_tree-description.tsv-1 B
        • ucbeniki_medmet_default.tsv-1 B
        • ucbeniki_medmet_all-examples_tree-description.tsv-1 B
        • ucbeniki_medmet_default_tree-description.xlsx-1 B
      • samostalnik
        • ucbeniki_samostalnik_all-examples_tree-description.xlsx-1 B
        • ucbeniki_samostalnik_default.tsv-1 B
        • ucbeniki_samostalnik_all-examples_tree-description.tsv-1 B
        • ucbeniki_samostalnik_default_tree-description.xlsx-1 B
        • ucbeniki_samostalnik_default_tree-description.tsv-1 B
        • ucbeniki_samostalnik_all-examples.tsv-1 B
      • predlog
        • ucbeniki_predlog_default_tree-description.xlsx-1 B
        • ucbeniki_predlog_default_tree-description.tsv-1 B
        • ucbeniki_predlog_all-examples_tree-description.tsv-1 B
        • ucbeniki_predlog_all-examples_tree-description.xlsx-1 B
        • ucbeniki_predlog_default.tsv-1 B
        • ucbeniki_predlog_all-examples.tsv-1 B
      • stevnik
        • ucbeniki_stevnik_default.tsv-1 B
        • ucbeniki_stevnik_all-examples_tree-description.xlsx-1 B
        • ucbeniki_stevnik_default_tree-description.xlsx-1 B
        • ucbeniki_stevnik_all-examples.tsv-1 B
        • ucbeniki_stevnik_default_tree-description.tsv-1 B
        • ucbeniki_stevnik_all-examples_tree-description.tsv-1 B
      • prislov
        • ucbeniki_prislov_default_tree-description.tsv-1 B
        • ucbeniki_prislov_default.tsv-1 B
        • ucbeniki_prislov_all-examples.tsv-1 B
        • ucbeniki_prislov_all-examples_tree-description.xlsx-1 B
        • ucbeniki_prislov_all-examples_tree-description.tsv-1 B
        • ucbeniki_prislov_default_tree-description.xlsx-1 B
      • veznik
        • ucbeniki_veznik_all-examples.tsv-1 B
        • ucbeniki_veznik_all-examples_tree-description.tsv-1 B
        • ucbeniki_veznik_default_tree-description.xlsx-1 B
        • ucbeniki_veznik_default.tsv-1 B
        • ucbeniki_veznik_default_tree-description.tsv-1 B
        • ucbeniki_veznik_all-examples_tree-description.xlsx-1 B
      • zaimek
        • ucbeniki_zaimek_all-examples.tsv-1 B
        • ucbeniki_zaimek_default.tsv-1 B
        • ucbeniki_zaimek_default_tree-description.tsv-1 B
        • ucbeniki_zaimek_all-examples_tree-description.tsv-1 B
        • ucbeniki_zaimek_default_tree-description.xlsx-1 B
        • ucbeniki_zaimek_all-examples_tree-description.xlsx-1 B
      • clenek
        • ucbeniki_clenek_default_tree-description.tsv-1 B
        • ucbeniki_clenek_all-examples.tsv-1 B
        • ucbeniki_clenek_default.tsv-1 B
        • ucbeniki_clenek_all-examples_tree-description.xlsx-1 B
        • ucbeniki_clenek_all-examples_tree-description.tsv-1 B
        • ucbeniki_clenek_default_tree-description.xlsx-1 B
      • okrajsava
        • ucbeniki_okrajsava_default.tsv-1 B
        • ucbeniki_okrajsava_all-examples_tree-description.xlsx-1 B
        • ucbeniki_okrajsava_default_tree-description.xlsx-1 B
        • ucbeniki_okrajsava_default_tree-description.tsv-1 B
        • ucbeniki_okrajsava_all-examples.tsv-1 B
        • ucbeniki_okrajsava_all-examples_tree-description.tsv-1 B
      • pridevnik
        • ucbeniki_pridevnik_default_tree-description.xlsx-1 B
        • ucbeniki_pridevnik_all-examples_tree-description.tsv-1 B
        • ucbeniki_pridevnik_all-examples_tree-description.xlsx-1 B
        • ucbeniki_pridevnik_all-examples.tsv-1 B
        • ucbeniki_pridevnik_default_tree-description.tsv-1 B
        • ucbeniki_pridevnik_default.tsv-1 B
      • config_ucbeniki_besednozvezne.ini-1 B
      • neuvrsceno
        • ucbeniki_neuvrsceno_all-examples_tree-description.tsv-1 B
        • ucbeniki_neuvrsceno_default.tsv-1 B
        • ucbeniki_neuvrsceno_all-examples_tree-description.xlsx-1 B
        • ucbeniki_neuvrsceno_default_tree-description.tsv-1 B
        • ucbeniki_neuvrsceno_all-examples.tsv-1 B
        • ucbeniki_neuvrsceno_default_tree-description.xlsx-1 B
      • glagol
        • ucbeniki_glagol_all-examples_tree-description.tsv-1 B
        • ucbeniki_glagol_default_tree-description.xlsx-1 B
        • ucbeniki_glagol_all-examples_tree-description.xlsx-1 B
        • ucbeniki_glagol_all-examples.tsv-1 B
        • ucbeniki_glagol_default_tree-description.tsv-1 B
        • ucbeniki_glagol_default.tsv-1 B
    • 00README.txt-1 B

Show simple item record