dc.contributor.author | Munda, Tina |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Dobrovoljc, Kaja |
dc.contributor.author | Rozman, Tadeja |
dc.contributor.author | Stritar Kučuk, Mojca |
dc.contributor.author | Krek, Simon |
dc.contributor.author | Krapš Vodopivec, Irena |
dc.contributor.author | Stabej, Marko |
dc.contributor.author | Pori, Eva |
dc.contributor.author | Goli, Teja |
dc.contributor.author | Lavrič, Polona |
dc.contributor.author | Laskowski, Cyprian |
dc.contributor.author | Kocjančič, Polonca |
dc.contributor.author | Klemenc, Bojan |
dc.contributor.author | Krsnik, Luka |
dc.contributor.author | Kosem, Iztok |
dc.date.accessioned | 2025-01-31T16:06:11Z |
dc.date.available | 2025-01-31T16:06:11Z |
dc.date.issued | 2025-01-30 |
dc.identifier.uri | http://hdl.handle.net/11356/2009 |
dc.description | The frequency lists of syntactic structures from the developmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589), specifically from the original, uncorrected student texts ("solar-orig.conllu") were extracted with the STARK v3 tool (http://hdl.handle.net/11356/1958). The extracted data is available at two levels: at the phrase level (see folder "besednozvezne") and at the sentence level (see folder "medstavcne"). At the phrase level, the extracted syntactic structures have a headword belonging to one of the following parts of speech, as defined by the MULTEXT-East system for morphosyntactic annotation of Slovene texts: noun (samostalnik), verb (glagol), adjective (pridevnik), adverb (prislov), pronoun (zaimek), numeral (števnik), predlog (adposition), veznik (conjunction), particle (členek), abbreviation (okrajšava) (no results were returned for interjection (medmet) and residual (neuvrščeno)). These structures were extracted based on the MULTEXT-East morphosyntax v6 (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and the JOS-SYN dependency syntax (https://wiki.cjvt.si/books/06-jos-syn-syntax), where the latter serves as a syntactic complement to the former. At the sentence level, the extracted syntactic structures link two clauses. The included types of clausal syntactic relations according to Universal Dependencies (UD) are: parataxis (soredje), coordination (priredje), and subordination (podredje), which is further divided into 4 main types according to UD: clausal subject (osebkov odvisnik), clausal object (predmetni odvisnik), adverbial cluase modifier (prislovni odvisniki), and adnominal clause modifier (prilastkov odvisnik). These structures were extracted based on the UD part-of-speech and syntactic relations annotations (https://wiki.cjvt.si/books/07-universal-dependencies). The dataset can be used for syntactic analyses of school writing in Slovene in (Slovene) schools, also in combination with comparable data (http://hdl.handle.net/11356/2010) from the Slovene textbook corpus Učbeniki 1.0, which presents the expected or desired scope of reception. For each part of speech (phrase level) or clausal relation (sentence level), there are 4 files: - "solar-orig_*_default.tsv" - the original output, containing extracted unique syntactic structures of varying lengths, ranging from 2 to 10 tokens, arranged by frequency, followed by additional data on syntactic structures and corpus-linguistic statistics (Absolute frequency, Relative frequency, MI, MI3, Dice, logDice, t-score, simple-LL). - "solar-orig_*_all-examples.tsv" - the original output, containing all matched structures found in the input corpus (i.e. all occurances of the extracted structures in every sentence). - "solar-orig_*_default_tree-description.tsv" - an extension of the "solar-orig_*_default.tsv" file that includes a verbal description of syntactic structures (trees). - "solar-orig_*_all-examples_metadata_tree-description.tsv" - an extension of the "solar-orig_*_all-examples.tsv" file that includes school text metadata and a verbal description of syntactic structures (trees). (The asterisk (*) in file names serves as a placeholder for a part of speech or a clausal relation.) The data was prepared in the following manner: First, the corpus was linguistically annotated with the CLASSLA v2.1 pipeline (https://github.com/clarinsi/classla/) at the levels of UD part-of-speech and syntactic relations annotations to enable the extraction of sentence-level structures. Furthermore, the original corpus containing MULTEXT-East tags (MSD tags) was preprocessed to reduce the tag to its first letter (e.g., Somei → S), which denotes the part of speech (the remaining letters represent the token's morphosyntactic features). This preprocessing step enabled extraction at the part-of-speech level, disregarding token-specific features, yet still displaying the full MSD tags as nodes in the extracted structures. (Note that STARK was originally developed for extracting data from UD-parsed corpora and was not designed for use cases like this one.) Then, the data was extracted with the STARK v3.0 tool (http://hdl.handle.net/11356/1958), based on predefined parameters in the "config.ini" file, with phrase-level structures extracted based on the MULTEXT-East and JOS-SYN annotation systems, and sentence-level structures extracted based on the UD schema. The sentence-level data underwent a postprocessing phase to remove duplicates that occured due to the phased extraction of complex connectives and to recalculate corpus-linguistic statistics based on the deduplicated data. Another step was to enhance all output files with verbal descriptions of the extracted structures and to enrich all "solar-orig_*_all-examples.tsv" files with school text metadata by assigning metadata from "solar-meta.tsv" (see "Solar.CoNLL-U.zip" in http://hdl.handle.net/11356/1589) to each structure based on matching text IDs (both with Python). Lastly, the extended versions of the two original output files ("solar-orig_*_default_tree-description.tsv", "solar-orig_*_all-examples_metadata_tree-description.tsv") were converted into Excel spreadsheets. The package also includes a configuration file for each level: "config_solar_besednozvezne.ini" for phrase-level structures, and "config_solar_medstavcne.ini" for sentence-level structures. These files contain all the parameter values used for data extraction with STARK. For more details, see "00README.txt". |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.publisher | Faculty of Arts, University of Ljubljana |
dc.relation.isreferencedby | https://zenodo.org/records/13936442 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.cjvt.si/prop/en/ |
dc.subject | developmental corpus |
dc.subject | student writing |
dc.subject | syntactic data |
dc.subject | syntactic structures |
dc.title | Frequency lists of syntactic structures from the Šolar 3.0 corpus |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | wordList |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tina Munda tina.munda@cjvt.si CJVT UL |
sponsor | ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
size.info | 1850324 entries |
files.count | 1 |
files.size | 345901438 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- solar3.0_skladenjske-strukture.zip
- Size
- 329.88 MB
- Format
- application/zip
- Description
- Syntactic structures from Šolar 3.0
- MD5
- 3d189355ea6785f039a541b05a70391e
- solar3.0_skladenjske-strukture
- medstavcne
- soredje
- solar-orig_soredje_default_tree-description.xlsx-1 B
- solar-orig_soredje_default.tsv-1 B
- solar-orig_soredje_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_soredje_all-examples_metadata_tree-description.tsv-1 B
- solar-orig_soredje_all-examples.tsv-1 B
- solar-orig_soredje_default_tree-description.tsv-1 B
- podredje
- osebkov-odv
- solar-orig_podredje-osebkov-odv_default_tree-description.xlsx-1 B
- solar-orig_podredje-osebkov-odv_default_tree-description.tsv-1 B
- solar-orig_podredje-osebkov-odv_all-examples_metadata_tree-description.tsv-1 B
- solar-orig_podredje-osebkov-odv_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_podredje-osebkov-odv_all-examples.tsv-1 B
- solar-orig_podredje-osebkov-odv_default.tsv-1 B
- prislovni-odv
- solar-orig_podredje-prislovni-odv_default_tree-description.xlsx-1 B
- solar-orig_podredje-prislovni-odv_default.tsv-1 B
- solar-orig_podredje-prislovni-odv_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_podredje-prislovni-odv_default_tree-description.tsv-1 B
- solar-orig_podredje-prislovni-odv_all-examples.tsv-1 B
- solar-orig_podredje-prislovni-odv_all-examples_metadata_tree-description.tsv-1 B
- prilastkov-odv
- solar-orig_podredje-prilastkov-odv_all-examples_metadata_tree-description.tsv-1 B
- solar-orig_podredje-prilastkov-odv_default_tree-description.tsv-1 B
- solar-orig_podredje-prilastkov-odv_default.tsv-1 B
- solar-orig_podredje-prilastkov-odv_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_podredje-prilastkov-odv_all-examples.tsv-1 B
- solar-orig_podredje-prilastkov-odv_default_tree-description.xlsx-1 B
- predmetni-odv
- solar-orig_podredje-predmetni-odv_default_tree-description.xlsx-1 B
- solar-orig_podredje-predmetni-odv_default.tsv-1 B
- solar-orig_podredje-predmetni-odv_all-examples.tsv-1 B
- solar-orig_podredje-predmetni-odv_default_tree-description.tsv-1 B
- solar-orig_podredje-predmetni-odv_all-examples_metadata_tree-description.tsv-1 B
- solar-orig_podredje-predmetni-odv_all-examples_metadata_tree-description.xlsx-1 B
- osebkov-odv
- priredje
- solar-orig_priredje_default_correct-decimal-values.tsv-1 B
- solar-orig_priredje_default.tsv-1 B
- solar-orig_priredje_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_priredje_default_tree-description.tsv-1 B
- solar-orig_priredje_all-examples.tsv-1 B
- solar-orig_priredje_all-examples_metadata_tree-description.tsv-1 B
- config_solar_medstavcne.ini-1 B
- soredje
- besednozvezne
- samostalnik
- solar-orig_samostalnik_default.tsv-1 B
- solar-orig_samostalnik_default_tree-description.tsv-1 B
- solar-orig_samostalnik_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_samostalnik_default_tree-description.xlsx-1 B
- solar-orig_samostalnik_all-examples.tsv-1 B
- solar-orig_samostalnik_all-examples_metadata_tree-description.tsv-1 B
- predlog
- solar-orig_predlog_default_tree-description.tsv-1 B
- solar-orig_predlog_all-examples.tsv-1 B
- solar-orig_predlog_default_tree-description.xlsx-1 B
- solar-orig_predlog_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_predlog_all-examples_metadata_tree-description.tsv-1 B
- solar-orig_predlog_default.tsv-1 B
- config_solar_besednozvezne.ini-1 B
- stevnik
- solar-orig_stevnik_default_tree-description.tsv-1 B
- solar-orig_stevnik_default_tree-description.xlsx-1 B
- solar-orig_stevnik_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_stevnik_all-examples.tsv-1 B
- solar-orig_stevnik_all-examples_metadata_tree-description.tsv-1 B
- solar-orig_stevnik_default.tsv-1 B
- prislov
- solar-orig_prislov_all-examples_metadata_tree-description.tsv-1 B
- solar-orig_prislov_default_tree-description.xlsx-1 B
- solar-orig_prislov_default.tsv-1 B
- solar-orig_prislov_default_tree-description.tsv-1 B
- solar-orig_prislov_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_prislov_all-examples.tsv-1 B
- veznik
- solar-orig_veznik_default_tree-description.tsv-1 B
- solar-orig_veznik_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_veznik_all-examples.tsv-1 B
- solar-orig_veznik_default_tree-description.xlsx-1 B
- solar-orig_veznik_default.tsv-1 B
- solar-orig_veznik_all-examples_metadata_tree-description.tsv-1 B
- zaimek
- solar-orig_zaimek_default_tree-description.xlsx-1 B
- solar-orig_zaimek_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_zaimek_default_tree-description.tsv-1 B
- solar-orig_zaimek_all-examples_metadata_tree-description.tsv-1 B
- solar-orig_zaimek_all-examples.tsv-1 B
- solar-orig_zaimek_default.tsv-1 B
- clenek
- solar-orig_clenek_all-examples_metadata_tree-description.tsv-1 B
- solar-orig_clenek_default_tree-description.xlsx-1 B
- solar-orig_clenek_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_clenek_all-examples.tsv-1 B
- solar-orig_clenek_default.tsv-1 B
- solar-orig_clenek_default_tree-description.tsv-1 B
- okrajsava
- solar-orig_okrajsava_default_tree-description.xlsx-1 B
- solar-orig_okrajsava_all-examples_metadata_tree-description.tsv-1 B
- solar-orig_okrajsava_default.tsv-1 B
- solar-orig_okrajsava_all-examples.tsv-1 B
- solar-orig_okrajsava_default_tree-description.tsv-1 B
- solar-orig_okrajsava_all-examples_metadata_tree-description.xlsx-1 B
- pridevnik
- solar-orig_pridevnik_all-examples.tsv-1 B
- solar-orig_pridevnik_default.tsv-1 B
- solar-orig_pridevnik_default_tree-description.xlsx-1 B
- solar-orig_pridevnik_default_tree-description.tsv-1 B
- solar-orig_pridevnik_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_pridevnik_all-examples_metadata_tree-description.tsv-1 B
- glagol
- solar-orig_glagol_all-examples.tsv-1 B
- solar-orig_glagol_default.tsv-1 B
- solar-orig_glagol_default_tree-description.tsv-1 B
- solar-orig_glagol_all-examples_metadata_tree-description.xlsx-1 B
- solar-orig_glagol_default_tree-description.xlsx-1 B
- solar-orig_glagol_all-examples_metadata_tree-description.tsv-1 B
- samostalnik
- 00README.txt-1 B
- medstavcne