dc.contributor.author | Munda, Tina |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Dobrovoljc, Kaja |
dc.contributor.author | Kosem, Iztok |
dc.contributor.author | Pori, Eva |
dc.contributor.author | Krek, Simon |
dc.date.accessioned | 2025-01-31T16:06:30Z |
dc.date.available | 2025-01-31T16:06:30Z |
dc.date.issued | 2025-01-30 |
dc.identifier.uri | http://hdl.handle.net/11356/2010 |
dc.description | The frequency lists of syntactic structures from the Slovene textbook corpus Učbeniki 1.0 were extracted with the STARK v3 tool (http://hdl.handle.net/11356/1958). The extracted data is available at two levels: at the phrase level (see folder "besednozvezne") and at the sentence level (see folder "medstavcne"). At the phrase level, the extracted syntactic structures have a headword belonging to one of the following parts of speech, as defined by the MULTEXT-East system for morphosyntactic annotation of Slovene texts: noun (samostalnik), verb (glagol), adjective (pridevnik), adverb (prislov), pronoun (zaimek), numeral (števnik), predlog (adposition), veznik (conjunction), particle (členek), abbreviation (okrajšava) (no results were returned for interjection (medmet) and residual (neuvrščeno)). These structures were extracted based on the MULTEXT-East morphosyntax v6 (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and the JOS-SYN dependency syntax (https://wiki.cjvt.si/books/06-jos-syn-syntax), where the latter serves as a syntactic complement to the former. At the sentence level, the extracted syntactic structures link two clauses. The included types of clausal syntactic relations according to Universal Dependencies (UD) are: parataxis (soredje), coordination (priredje), and subordination (podredje), which is further divided into 4 main types according to UD: clausal subject (osebkov odvisnik), clausal object (predmetni odvisnik), adverbial cluase modifier (prislovni odvisniki), and adnominal clause modifier (prilastkov odvisnik). These structures were extracted based on the UD part-of-speech and syntactic relations annotations (https://wiki.cjvt.si/books/07-universal-dependencies). The dataset can be used for syntactic analyses in combination with comparable data (http://hdl.handle.net/11356/2009) from develpmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589), the present data representing the expected or desired scope of reception. For each part of speech (phrase level) or clausal relation (sentence level), there are 4 files: - "ucbeniki_*_default.tsv" - the original output, containing extracted unique syntactic structures of varying lengths, ranging from 2 to 10 tokens, arranged by frequency, followed by additional data on syntactic structures and corpus-linguistic statistics (Absolute frequency, Relative frequency, MI, MI3, Dice, logDice, t-score, simple-LL). - "ucbeniki_*_all-examples.tsv" - the original output, containing all matched structures found in the input corpus (i.e. all occurances of the extracted structures in every sentence). - "ucbeniki_*_default_tree-description.tsv" - an extension of the "ucbeniki_*_default.tsv" file that includes a verbal description of syntactic structures (trees). - "ucbeniki_*_all-examples_tree-description.tsv" - an extension of the "ucbeniki_*_all-examples.tsv" file that includes a verbal description of syntactic structures (trees). (The asterisk (*) in file names serves as a placeholder for a part of speech or a clausal relation.) The data was prepared in the following manner: The individual files of Slovene school textbooks were merged into a single CONLLU file. The corpus was already linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla/) at the levels of the MULTEXT-East v6 morphosyntax, JOS-SYN dependency syntax, and UD part-of-speech and syntactic relations annotations. Furthermore, the original corpus was preprocessed to reduce the MSD tag to its first letter (e.g., Somei → S), which denotes the part of speech (the remaining letters represent the token's morphosyntactic features). This preprocessing step enabled extraction at the part-of-speech level, disregarding token-specific features, yet still displaying the full MSD tags as nodes in the extracted structures. (Note that STARK was originally developed for extracting data from UD-parsed corpora and was not designed for use cases like this one.) Then, the data was extracted with the STARK v3.0 tool (http://hdl.handle.net/11356/1958), based on predefined parameters in the "config.ini" file, with phrase-level structures extracted based on the MULTEXT-East and JOS-SYN annotation systems, and sentence-level structures extracted based on the UD schema. The sentence-level data underwent a postprocessing phase to remove duplicates that occured due to the phased extraction of complex connectives and to recalculate corpus-linguistic statistics based on the deduplicated data. Another step was to enhance all output files with verbal descriptions of the extracted structures. Lastly, the extended versions of the two original output files ("ucbeniki_*_default_tree-description.tsv", "ucbeniki_*_all-examples_tree-description.tsv") were converted into Excel spreadsheets. The package also includes a configuration file for each level: "config_ucbeniki_besednozvezne.ini" for phrase-level structures, and "config_ucbeniki_medstavcne.ini" for sentence-level structures. These files contain all the parameter values used for data extraction with STARK. For more details, see "00README.txt". |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.publisher | Faculty of Arts, University of Ljubljana |
dc.relation.isreferencedby | https://zenodo.org/records/13936442 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.cjvt.si/prop/en/ |
dc.subject | textbook corpus |
dc.subject | pedagogic corpus |
dc.subject | student reading |
dc.subject | syntactic data |
dc.subject | syntactic structures |
dc.title | Frequency lists of syntactic structures from the Učbeniki 1.0 corpus |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | wordList |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tina Munda tina.munda@cjvt.si CJVT UL |
sponsor | ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
size.info | 4829536 entries |
files.count | 1 |
files.size | 831460397 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- ucbeniki1.0_skladenjske-strukture.zip
- Size
- 792.94 MB
- Format
- application/zip
- Description
- Syntactic structures from Učbeniki 1.0
- MD5
- 0a70a3ea8db5cc4a50d1f881e7ba13eb
- ucbeniki1.0_skladenjske-strukture
- medstavcne
- soredje
- ucbeniki_soredje_all-examples_tree-description.tsv-1 B
- ucbeniki_soredje_default_tree-description.xlsx-1 B
- ucbeniki_soredje_default_tree-description.tsv-1 B
- ucbeniki_soredje_default.tsv-1 B
- ucbeniki_soredje_all-examples_tree-description.xlsx-1 B
- ucbeniki_soredje_all-examples.tsv-1 B
- podredje
- osebkov-odv
- ucbeniki_podredje-osebkov-odv_default.tsv-1 B
- ucbeniki_podredje-osebkov-odv_default_tree-description.xlsx-1 B
- ucbeniki_podredje-osebkov-odv_all-examples_tree-description.tsv-1 B
- ucbeniki_podredje-osebkov-odv_default_tree-description.tsv-1 B
- ucbeniki_podredje-osebkov-odv_all-examples_tree-description.xlsx-1 B
- ucbeniki_podredje-osebkov-odv_all-examples.tsv-1 B
- prislovni-odv
- ucbeniki_podredje-prislovni-odv_default_tree-description.tsv-1 B
- ucbeniki_podredje-prislovni-odv_all-examples_tree-description.tsv-1 B
- ucbeniki_podredje-prislovni-odv_default_tree-description.xlsx-1 B
- ucbeniki_podredje-prislovni-odv_all-examples_tree-description.xlsx-1 B
- ucbeniki_podredje-prislovni-odv_all-examples.tsv-1 B
- ucbeniki_podredje-prislovni-odv_default.tsv-1 B
- prilastkov-odv
- ucbeniki_podredje-prilastkov-odv_default_tree-description.tsv-1 B
- ucbeniki_podredje-prilastkov-odv_all-examples.tsv-1 B
- ucbeniki_podredje-prilastkov-odv_default.tsv-1 B
- ucbeniki_podredje-prilastkov-odv_all-examples_tree-description.xlsx-1 B
- ucbeniki_podredje-prilastkov-odv_all-examples_tree-description.tsv-1 B
- ucbeniki_podredje-prilastkov-odv_default_tree-description.xlsx-1 B
- predmetni-odv
- ucbeniki_podredje-predmetni-odv_all-examples.tsv-1 B
- ucbeniki_podredje-predmetni-odv_all-examples_tree-description.tsv-1 B
- ucbeniki_podredje-predmetni-odv_default_tree-description.xlsx-1 B
- ucbeniki_podredje-predmetni-odv_all-examples_tree-description.xlsx-1 B
- ucbeniki_podredje-predmetni-odv_default_tree-description.tsv-1 B
- ucbeniki_podredje-predmetni-odv_default.tsv-1 B
- osebkov-odv
- config_ucbeniki_medstavcne.ini-1 B
- priredje
- ucbeniki_priredje_default_tree-description.xlsx-1 B
- ucbeniki_priredje_all-examples_tree-description.tsv-1 B
- ucbeniki_priredje_all-examples_tree-description.xlsx-1 B
- ucbeniki_priredje_default_tree-description.tsv-1 B
- ucbeniki_priredje_default.tsv-1 B
- ucbeniki_priredje_all-examples.tsv-1 B
- soredje
- besednozvezne
- medmet
- ucbeniki_medmet_all-examples_tree-description.xlsx-1 B
- ucbeniki_medmet_all-examples.tsv-1 B
- ucbeniki_medmet_default_tree-description.tsv-1 B
- ucbeniki_medmet_default.tsv-1 B
- ucbeniki_medmet_all-examples_tree-description.tsv-1 B
- ucbeniki_medmet_default_tree-description.xlsx-1 B
- samostalnik
- ucbeniki_samostalnik_all-examples_tree-description.xlsx-1 B
- ucbeniki_samostalnik_default.tsv-1 B
- ucbeniki_samostalnik_all-examples_tree-description.tsv-1 B
- ucbeniki_samostalnik_default_tree-description.xlsx-1 B
- ucbeniki_samostalnik_default_tree-description.tsv-1 B
- ucbeniki_samostalnik_all-examples.tsv-1 B
- predlog
- ucbeniki_predlog_default_tree-description.xlsx-1 B
- ucbeniki_predlog_default_tree-description.tsv-1 B
- ucbeniki_predlog_all-examples_tree-description.tsv-1 B
- ucbeniki_predlog_all-examples_tree-description.xlsx-1 B
- ucbeniki_predlog_default.tsv-1 B
- ucbeniki_predlog_all-examples.tsv-1 B
- stevnik
- ucbeniki_stevnik_default.tsv-1 B
- ucbeniki_stevnik_all-examples_tree-description.xlsx-1 B
- ucbeniki_stevnik_default_tree-description.xlsx-1 B
- ucbeniki_stevnik_all-examples.tsv-1 B
- ucbeniki_stevnik_default_tree-description.tsv-1 B
- ucbeniki_stevnik_all-examples_tree-description.tsv-1 B
- prislov
- ucbeniki_prislov_default_tree-description.tsv-1 B
- ucbeniki_prislov_default.tsv-1 B
- ucbeniki_prislov_all-examples.tsv-1 B
- ucbeniki_prislov_all-examples_tree-description.xlsx-1 B
- ucbeniki_prislov_all-examples_tree-description.tsv-1 B
- ucbeniki_prislov_default_tree-description.xlsx-1 B
- veznik
- ucbeniki_veznik_all-examples.tsv-1 B
- ucbeniki_veznik_all-examples_tree-description.tsv-1 B
- ucbeniki_veznik_default_tree-description.xlsx-1 B
- ucbeniki_veznik_default.tsv-1 B
- ucbeniki_veznik_default_tree-description.tsv-1 B
- ucbeniki_veznik_all-examples_tree-description.xlsx-1 B
- zaimek
- ucbeniki_zaimek_all-examples.tsv-1 B
- ucbeniki_zaimek_default.tsv-1 B
- ucbeniki_zaimek_default_tree-description.tsv-1 B
- ucbeniki_zaimek_all-examples_tree-description.tsv-1 B
- ucbeniki_zaimek_default_tree-description.xlsx-1 B
- ucbeniki_zaimek_all-examples_tree-description.xlsx-1 B
- clenek
- ucbeniki_clenek_default_tree-description.tsv-1 B
- ucbeniki_clenek_all-examples.tsv-1 B
- ucbeniki_clenek_default.tsv-1 B
- ucbeniki_clenek_all-examples_tree-description.xlsx-1 B
- ucbeniki_clenek_all-examples_tree-description.tsv-1 B
- ucbeniki_clenek_default_tree-description.xlsx-1 B
- okrajsava
- ucbeniki_okrajsava_default.tsv-1 B
- ucbeniki_okrajsava_all-examples_tree-description.xlsx-1 B
- ucbeniki_okrajsava_default_tree-description.xlsx-1 B
- ucbeniki_okrajsava_default_tree-description.tsv-1 B
- ucbeniki_okrajsava_all-examples.tsv-1 B
- ucbeniki_okrajsava_all-examples_tree-description.tsv-1 B
- pridevnik
- ucbeniki_pridevnik_default_tree-description.xlsx-1 B
- ucbeniki_pridevnik_all-examples_tree-description.tsv-1 B
- ucbeniki_pridevnik_all-examples_tree-description.xlsx-1 B
- ucbeniki_pridevnik_all-examples.tsv-1 B
- ucbeniki_pridevnik_default_tree-description.tsv-1 B
- ucbeniki_pridevnik_default.tsv-1 B
- config_ucbeniki_besednozvezne.ini-1 B
- neuvrsceno
- ucbeniki_neuvrsceno_all-examples_tree-description.tsv-1 B
- ucbeniki_neuvrsceno_default.tsv-1 B
- ucbeniki_neuvrsceno_all-examples_tree-description.xlsx-1 B
- ucbeniki_neuvrsceno_default_tree-description.tsv-1 B
- ucbeniki_neuvrsceno_all-examples.tsv-1 B
- ucbeniki_neuvrsceno_default_tree-description.xlsx-1 B
- glagol
- ucbeniki_glagol_all-examples_tree-description.tsv-1 B
- ucbeniki_glagol_default_tree-description.xlsx-1 B
- ucbeniki_glagol_all-examples_tree-description.xlsx-1 B
- ucbeniki_glagol_all-examples.tsv-1 B
- ucbeniki_glagol_default_tree-description.tsv-1 B
- ucbeniki_glagol_default.tsv-1 B
- medmet
- 00README.txt-1 B
- medstavcne