Show simple item record

 
dc.contributor.author Munda, Tina
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Rozman, Tadeja
dc.contributor.author Stritar Kučuk, Mojca
dc.contributor.author Krek, Simon
dc.contributor.author Krapš Vodopivec, Irena
dc.contributor.author Stabej, Marko
dc.contributor.author Pori, Eva
dc.contributor.author Goli, Teja
dc.contributor.author Lavrič, Polona
dc.contributor.author Laskowski, Cyprian
dc.contributor.author Kocjančič, Polonca
dc.contributor.author Klemenc, Bojan
dc.contributor.author Krsnik, Luka
dc.contributor.author Kosem, Iztok
dc.date.accessioned 2025-01-31T16:06:11Z
dc.date.available 2025-01-31T16:06:11Z
dc.date.issued 2025-01-30
dc.identifier.uri http://hdl.handle.net/11356/2009
dc.description The frequency lists of syntactic structures from the developmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589), specifically from the original, uncorrected student texts ("solar-orig.conllu") were extracted with the STARK v3 tool (http://hdl.handle.net/11356/1958). The extracted data is available at two levels: at the phrase level (see folder "besednozvezne") and at the sentence level (see folder "medstavcne"). At the phrase level, the extracted syntactic structures have a headword belonging to one of the following parts of speech, as defined by the MULTEXT-East system for morphosyntactic annotation of Slovene texts: noun (samostalnik), verb (glagol), adjective (pridevnik), adverb (prislov), pronoun (zaimek), numeral (števnik), predlog (adposition), veznik (conjunction), particle (členek), abbreviation (okrajšava) (no results were returned for interjection (medmet) and residual (neuvrščeno)). These structures were extracted based on the MULTEXT-East morphosyntax v6 (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and the JOS-SYN dependency syntax (https://wiki.cjvt.si/books/06-jos-syn-syntax), where the latter serves as a syntactic complement to the former. At the sentence level, the extracted syntactic structures link two clauses. The included types of clausal syntactic relations according to Universal Dependencies (UD) are: parataxis (soredje), coordination (priredje), and subordination (podredje), which is further divided into 4 main types according to UD: clausal subject (osebkov odvisnik), clausal object (predmetni odvisnik), adverbial cluase modifier (prislovni odvisniki), and adnominal clause modifier (prilastkov odvisnik). These structures were extracted based on the UD part-of-speech and syntactic relations annotations (https://wiki.cjvt.si/books/07-universal-dependencies). The dataset can be used for syntactic analyses of school writing in Slovene in (Slovene) schools, also in combination with comparable data (http://hdl.handle.net/11356/2010) from the Slovene textbook corpus Učbeniki 1.0, which presents the expected or desired scope of reception. For each part of speech (phrase level) or clausal relation (sentence level), there are 4 files: - "solar-orig_*_default.tsv" - the original output, containing extracted unique syntactic structures of varying lengths, ranging from 2 to 10 tokens, arranged by frequency, followed by additional data on syntactic structures and corpus-linguistic statistics (Absolute frequency, Relative frequency, MI, MI3, Dice, logDice, t-score, simple-LL). - "solar-orig_*_all-examples.tsv" - the original output, containing all matched structures found in the input corpus (i.e. all occurances of the extracted structures in every sentence). - "solar-orig_*_default_tree-description.tsv" - an extension of the "solar-orig_*_default.tsv" file that includes a verbal description of syntactic structures (trees). - "solar-orig_*_all-examples_metadata_tree-description.tsv" - an extension of the "solar-orig_*_all-examples.tsv" file that includes school text metadata and a verbal description of syntactic structures (trees). (The asterisk (*) in file names serves as a placeholder for a part of speech or a clausal relation.) The data was prepared in the following manner: First, the corpus was linguistically annotated with the CLASSLA v2.1 pipeline (https://github.com/clarinsi/classla/) at the levels of UD part-of-speech and syntactic relations annotations to enable the extraction of sentence-level structures. Furthermore, the original corpus containing MULTEXT-East tags (MSD tags) was preprocessed to reduce the tag to its first letter (e.g., Somei → S), which denotes the part of speech (the remaining letters represent the token's morphosyntactic features). This preprocessing step enabled extraction at the part-of-speech level, disregarding token-specific features, yet still displaying the full MSD tags as nodes in the extracted structures. (Note that STARK was originally developed for extracting data from UD-parsed corpora and was not designed for use cases like this one.) Then, the data was extracted with the STARK v3.0 tool (http://hdl.handle.net/11356/1958), based on predefined parameters in the "config.ini" file, with phrase-level structures extracted based on the MULTEXT-East and JOS-SYN annotation systems, and sentence-level structures extracted based on the UD schema. The sentence-level data underwent a postprocessing phase to remove duplicates that occured due to the phased extraction of complex connectives and to recalculate corpus-linguistic statistics based on the deduplicated data. Another step was to enhance all output files with verbal descriptions of the extracted structures and to enrich all "solar-orig_*_all-examples.tsv" files with school text metadata by assigning metadata from "solar-meta.tsv" (see "Solar.CoNLL-U.zip" in http://hdl.handle.net/11356/1589) to each structure based on matching text IDs (both with Python). Lastly, the extended versions of the two original output files ("solar-orig_*_default_tree-description.tsv", "solar-orig_*_all-examples_metadata_tree-description.tsv") were converted into Excel spreadsheets. The package also includes a configuration file for each level: "config_solar_besednozvezne.ini" for phrase-level structures, and "config_solar_medstavcne.ini" for sentence-level structures. These files contain all the parameter values used for data extraction with STARK. For more details, see "00README.txt".
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.publisher Faculty of Arts, University of Ljubljana
dc.relation.isreferencedby https://zenodo.org/records/13936442
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.cjvt.si/prop/en/
dc.subject developmental corpus
dc.subject student writing
dc.subject syntactic data
dc.subject syntactic structures
dc.title Frequency lists of syntactic structures from the Šolar 3.0 corpus
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tina Munda tina.munda@cjvt.si CJVT UL
sponsor ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 1850324 entries
files.count 1
files.size 345901438


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
solar3.0_skladenjske-strukture.zip
Size
329.88 MB
Format
application/zip
Description
Syntactic structures from Šolar 3.0
MD5
3d189355ea6785f039a541b05a70391e
 Download file  Preview
 File Preview  
  • solar3.0_skladenjske-strukture
    • medstavcne
      • soredje
        • solar-orig_soredje_default_tree-description.xlsx-1 B
        • solar-orig_soredje_default.tsv-1 B
        • solar-orig_soredje_all-examples_metadata_tree-description.xlsx-1 B
        • solar-orig_soredje_all-examples_metadata_tree-description.tsv-1 B
        • solar-orig_soredje_all-examples.tsv-1 B
        • solar-orig_soredje_default_tree-description.tsv-1 B
      • podredje
        • osebkov-odv
          • solar-orig_podredje-osebkov-odv_default_tree-description.xlsx-1 B
          • solar-orig_podredje-osebkov-odv_default_tree-description.tsv-1 B
          • solar-orig_podredje-osebkov-odv_all-examples_metadata_tree-description.tsv-1 B
          • solar-orig_podredje-osebkov-odv_all-examples_metadata_tree-description.xlsx-1 B
          • solar-orig_podredje-osebkov-odv_all-examples.tsv-1 B
          • solar-orig_podredje-osebkov-odv_default.tsv-1 B
        • prislovni-odv
          • solar-orig_podredje-prislovni-odv_default_tree-description.xlsx-1 B
          • solar-orig_podredje-prislovni-odv_default.tsv-1 B
          • solar-orig_podredje-prislovni-odv_all-examples_metadata_tree-description.xlsx-1 B
          • solar-orig_podredje-prislovni-odv_default_tree-description.tsv-1 B
          • solar-orig_podredje-prislovni-odv_all-examples.tsv-1 B
          • solar-orig_podredje-prislovni-odv_all-examples_metadata_tree-description.tsv-1 B
        • prilastkov-odv
          • solar-orig_podredje-prilastkov-odv_all-examples_metadata_tree-description.tsv-1 B
          • solar-orig_podredje-prilastkov-odv_default_tree-description.tsv-1 B
          • solar-orig_podredje-prilastkov-odv_default.tsv-1 B
          • solar-orig_podredje-prilastkov-odv_all-examples_metadata_tree-description.xlsx-1 B
          • solar-orig_podredje-prilastkov-odv_all-examples.tsv-1 B
          • solar-orig_podredje-prilastkov-odv_default_tree-description.xlsx-1 B
        • predmetni-odv
          • solar-orig_podredje-predmetni-odv_default_tree-description.xlsx-1 B
          • solar-orig_podredje-predmetni-odv_default.tsv-1 B
          • solar-orig_podredje-predmetni-odv_all-examples.tsv-1 B
          • solar-orig_podredje-predmetni-odv_default_tree-description.tsv-1 B
          • solar-orig_podredje-predmetni-odv_all-examples_metadata_tree-description.tsv-1 B
          • solar-orig_podredje-predmetni-odv_all-examples_metadata_tree-description.xlsx-1 B
      • priredje
        • solar-orig_priredje_default_correct-decimal-values.tsv-1 B
        • solar-orig_priredje_default.tsv-1 B
        • solar-orig_priredje_all-examples_metadata_tree-description.xlsx-1 B
        • solar-orig_priredje_default_tree-description.tsv-1 B
        • solar-orig_priredje_all-examples.tsv-1 B
        • solar-orig_priredje_all-examples_metadata_tree-description.tsv-1 B
      • config_solar_medstavcne.ini-1 B
    • besednozvezne
      • samostalnik
        • solar-orig_samostalnik_default.tsv-1 B
        • solar-orig_samostalnik_default_tree-description.tsv-1 B
        • solar-orig_samostalnik_all-examples_metadata_tree-description.xlsx-1 B
        • solar-orig_samostalnik_default_tree-description.xlsx-1 B
        • solar-orig_samostalnik_all-examples.tsv-1 B
        • solar-orig_samostalnik_all-examples_metadata_tree-description.tsv-1 B
      • predlog
        • solar-orig_predlog_default_tree-description.tsv-1 B
        • solar-orig_predlog_all-examples.tsv-1 B
        • solar-orig_predlog_default_tree-description.xlsx-1 B
        • solar-orig_predlog_all-examples_metadata_tree-description.xlsx-1 B
        • solar-orig_predlog_all-examples_metadata_tree-description.tsv-1 B
        • solar-orig_predlog_default.tsv-1 B
      • config_solar_besednozvezne.ini-1 B
      • stevnik
        • solar-orig_stevnik_default_tree-description.tsv-1 B
        • solar-orig_stevnik_default_tree-description.xlsx-1 B
        • solar-orig_stevnik_all-examples_metadata_tree-description.xlsx-1 B
        • solar-orig_stevnik_all-examples.tsv-1 B
        • solar-orig_stevnik_all-examples_metadata_tree-description.tsv-1 B
        • solar-orig_stevnik_default.tsv-1 B
      • prislov
        • solar-orig_prislov_all-examples_metadata_tree-description.tsv-1 B
        • solar-orig_prislov_default_tree-description.xlsx-1 B
        • solar-orig_prislov_default.tsv-1 B
        • solar-orig_prislov_default_tree-description.tsv-1 B
        • solar-orig_prislov_all-examples_metadata_tree-description.xlsx-1 B
        • solar-orig_prislov_all-examples.tsv-1 B
      • veznik
        • solar-orig_veznik_default_tree-description.tsv-1 B
        • solar-orig_veznik_all-examples_metadata_tree-description.xlsx-1 B
        • solar-orig_veznik_all-examples.tsv-1 B
        • solar-orig_veznik_default_tree-description.xlsx-1 B
        • solar-orig_veznik_default.tsv-1 B
        • solar-orig_veznik_all-examples_metadata_tree-description.tsv-1 B
      • zaimek
        • solar-orig_zaimek_default_tree-description.xlsx-1 B
        • solar-orig_zaimek_all-examples_metadata_tree-description.xlsx-1 B
        • solar-orig_zaimek_default_tree-description.tsv-1 B
        • solar-orig_zaimek_all-examples_metadata_tree-description.tsv-1 B
        • solar-orig_zaimek_all-examples.tsv-1 B
        • solar-orig_zaimek_default.tsv-1 B
      • clenek
        • solar-orig_clenek_all-examples_metadata_tree-description.tsv-1 B
        • solar-orig_clenek_default_tree-description.xlsx-1 B
        • solar-orig_clenek_all-examples_metadata_tree-description.xlsx-1 B
        • solar-orig_clenek_all-examples.tsv-1 B
        • solar-orig_clenek_default.tsv-1 B
        • solar-orig_clenek_default_tree-description.tsv-1 B
      • okrajsava
        • solar-orig_okrajsava_default_tree-description.xlsx-1 B
        • solar-orig_okrajsava_all-examples_metadata_tree-description.tsv-1 B
        • solar-orig_okrajsava_default.tsv-1 B
        • solar-orig_okrajsava_all-examples.tsv-1 B
        • solar-orig_okrajsava_default_tree-description.tsv-1 B
        • solar-orig_okrajsava_all-examples_metadata_tree-description.xlsx-1 B
      • pridevnik
        • solar-orig_pridevnik_all-examples.tsv-1 B
        • solar-orig_pridevnik_default.tsv-1 B
        • solar-orig_pridevnik_default_tree-description.xlsx-1 B
        • solar-orig_pridevnik_default_tree-description.tsv-1 B
        • solar-orig_pridevnik_all-examples_metadata_tree-description.xlsx-1 B
        • solar-orig_pridevnik_all-examples_metadata_tree-description.tsv-1 B
      • glagol
        • solar-orig_glagol_all-examples.tsv-1 B
        • solar-orig_glagol_default.tsv-1 B
        • solar-orig_glagol_default_tree-description.tsv-1 B
        • solar-orig_glagol_all-examples_metadata_tree-description.xlsx-1 B
        • solar-orig_glagol_default_tree-description.xlsx-1 B
        • solar-orig_glagol_all-examples_metadata_tree-description.tsv-1 B
    • 00README.txt-1 B

Show simple item record