dc.contributor.author | Munda, Tina |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Rozman, Tadeja |
dc.contributor.author | Stritar Kučuk, Mojca |
dc.contributor.author | Krek, Simon |
dc.contributor.author | Krapš Vodopivec, Irena |
dc.contributor.author | Stabej, Marko |
dc.contributor.author | Pori, Eva |
dc.contributor.author | Goli, Teja |
dc.contributor.author | Lavrič, Polona |
dc.contributor.author | Laskowski, Cyprian |
dc.contributor.author | Kocjančič, Polonca |
dc.contributor.author | Klemenc, Bojan |
dc.contributor.author | Krsnik, Luka |
dc.contributor.author | Kosem, Iztok |
dc.date.accessioned | 2025-02-01T06:48:05Z |
dc.date.available | 2025-02-01T06:48:05Z |
dc.date.issued | 2025-01-31 |
dc.identifier.uri | http://hdl.handle.net/11356/2011 |
dc.description | The frequency list of collocations from the developmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589), specifically from the original, uncorrected student texts ("solar-orig.conllu") was extracted with the CORDEX library (https://github.com/clarinsi/cordex/). The extraction is based on 82 predefined syntactic structures (cf. Krek et al., 2021) using the MULTEXT-East morphosyntactic (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and JOS-SYN dependency parsing (https://wiki.cjvt.si/books/06-jos-syn-syntax) annotations, where the latter serves as a syntactic complement to the former. The formal description of syntactic structures is included in the CORDEX library (see "structures_JOS.xml"). There are 3 output files: - solar-orig3.0_kolokacije.csv" contains the original output of collocations with absolute frequency 1 and above, corresponding to 81 (out of 82) predefined syntactic structures. The list is sorted by absolute frequency of collocations (Joint_representative_form) and includes frequency and POS information for each lemma of the collocation. The file also provides additional statistical measures (Delta_p12, Delta_p21, LogDice_core, LogDice_all) and shows the number of distinct forms in which the lemmas appear in the corpus for each collocation. - "solar-orig3.0_kolokacije_collocation_sentence_mapper.csv" complements the file above by showing all occurrences of the extracted collocations in the corpus. Each row lists a collocation ID (matching the first file), identifies the sentence in which the collocation appears, and provides the exact tokens that form the collocation. - "solar-orig3.0_kolokacije_collocation_sentence_mapper_metadata.csv" is an extension of the "solar-orig3.0_kolokacije_collocation_sentence_mapper.csv" file that includes school-text metadata. The dataset can be used for analyses of school writing in Slovene in (Slovene) schools, especially in combination with comparable data (http://hdl.handle.net/11356/2012) from the Slovene textbook corpus Učbeniki 1.0—which presents the expected or desired scope of reception—to identify core student vocabulary. The data was prepared in the following manner: In the preprocessing phase, the MULTEXT-East morphosyntactic tags (MSD tags) in the CoNLL-U input corpus were converted from Slovene to their English equivalents because the library then in use did not support Slovene MSD tags. Next, collocation data were extracted using the CORDEX library. Any collocations containing punctuation were excluded from the output. The lookup lexicon (https://www.clarin.si/repository/xmlui/handle/11356/1854) was used to improve collocation representations (applicable only when using the JOS system). In the postprocessing phase, the MSD tags in the output were translated back into their original Slovene MSD tags. For more details, see "00README.txt". --- KREK, Simon, GANTAR, Polona, KOSEM, Iztok, DOBROVOLJC, Kaja. Opis modela za pridobivanje in strukturiranje kolokacijskih podatkov iz korpusa. V: ARHAR HOLDT, Špela (ur.). Nova slovnica sodobne standardne slovenščine : viri in metode. 1. izd. Ljubljana: Znanstvena založba Filozofske fakultete, 2021. Str. 160-194, ilustr. Zbirka Sporazumevanje. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/325/477/7320 |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.publisher | Faculty of Arts, University of Ljubljana |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.cjvt.si/prop/en/ |
dc.subject | developmental corpus |
dc.subject | student writing |
dc.subject | collocation data |
dc.subject | collocations |
dc.subject | syntactic structures |
dc.title | Frequency list of collocations from the Šolar 3.0 corpus |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | wordList |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tina Munda tina.munda@cjvt.si CJVT UL |
sponsor | ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
size.info | 256312 entries |
files.count | 1 |
files.size | 22502902 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- solar3.0_kolokacije.zip
- Size
- 21.46 MB
- Format
- application/zip
- Description
- Collocations from Šolar 3.0
- MD5
- 576fe5780e1516c22f86dd5660ea7cf1
- solar3.0_kolokacije
- solar-orig3.0_kolokacije_collocation_sentence_mapper.csv-1 B
- solar-orig3.0_kolokacije.csv-1 B
- solar-orig3.0_kolokacije_collocation_sentence_mapper_metadata.csv-1 B
- 00README.txt-1 B