dc.contributor.author | Munda, Tina |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Kosem, Iztok |
dc.contributor.author | Pori, Eva |
dc.contributor.author | Krek, Simon |
dc.date.accessioned | 2025-02-01T06:48:33Z |
dc.date.available | 2025-02-01T06:48:33Z |
dc.date.issued | 2025-01-31 |
dc.identifier.uri | http://hdl.handle.net/11356/2012 |
dc.description | The frequency list of collocations from the Slovene textbook corpus Učbeniki 1.0 was extracted with the CORDEX library (https://github.com/clarinsi/cordex/). The extraction is based on 82 predefined syntactic structures (cf. Krek et al., 2021) using the MULTEXT-East morphosyntactic (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and JOS-SYN dependency parsing (https://wiki.cjvt.si/books/06-jos-syn-syntax) annotations, where the latter serves as a syntactic complement to the former. The formal description of syntactic structures is included in the CORDEX library (see "structures_JOS.xml"). There are 2 output files: - "ucbeniki1.0_kolokacije.csv" contains the original output of collocations with absolute frequency 1 and above, corresponding to 82 predefined syntactic structures. The list is sorted by absolute frequency of collocations (Joint_representative_form) and includes frequency and POS information for each lemma of the collocation. The file also provides additional statistical measures (Delta_p12, Delta_p21, LogDice_core, LogDice_all) and shows the number of distinct forms in which the lemmas appear in the corpus for each collocation. - "ucbeniki1.0_kolokacije_collocation_sentence_mapper.csv" complements the file above by showing all occurrences of the extracted collocations in the corpus. Each row lists a collocation ID (matching the first file), identifies the sentence in which the collocation appears, and provides the exact tokens that form the collocation. The dataset can be used for analyses, especially in combination with comparable data (http://hdl.handle.net/11356/2011) from the develpmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589) to identify core student vocabulary. The data was prepared in the following manner: In the preprocessing phase, all individual Slovene school textbooks were merged into a single CoNLL-U file. Because the library then in use did not support Slovene MULTEXT-East morphosyntactic tags (MSD tags), these tags were converted into their English equivalents. Next, collocation data were extracted using the CORDEX library. Any collocations containing punctuation were excluded from the output. The lookup lexicon (https://www.clarin.si/repository/xmlui/handle/11356/1854) was used to improve collocation representations (applicable only when using the JOS system). In the postprocessing phase, the MSD tags in the output were translated back into Slovene MSD tags. For more details, see "00README.txt". --- KREK, Simon, GANTAR, Polona, KOSEM, Iztok, DOBROVOLJC, Kaja. Opis modela za pridobivanje in strukturiranje kolokacijskih podatkov iz korpusa. V: ARHAR HOLDT, Špela (ur.). Nova slovnica sodobne standardne slovenščine : viri in metode. 1. izd. Ljubljana: Znanstvena založba Filozofske fakultete, 2021. Str. 160-194, ilustr. Zbirka Sporazumevanje. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/325/477/7320 |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.publisher | Faculty of Arts, University of Ljubljana |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.cjvt.si/prop/en/ |
dc.subject | textbook corpus |
dc.subject | pedagogic corpus |
dc.subject | student reading |
dc.subject | collocation data |
dc.subject | collocations |
dc.subject | syntactic structures |
dc.title | Frequency list of collocations from the Učbeniki 1.0 corpus |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | wordList |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tina Munda tina.munda@cjvt.si CJVT UL |
sponsor | ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
size.info | 701943 entries |
files.count | 1 |
files.size | 45268159 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- ucbeniki1.0_kolokacije.zip
- Size
- 43.17 MB
- Format
- application/zip
- Description
- Collocations from Učbeniki 1.0
- MD5
- bf0441e3c594c7d59014a1d8ce6ac592
- ucbeniki1.0_kolokacije
- ucbeniki1.0_kolokacije.csv-1 B
- ucbeniki1.0_kolokacije_collocations_sentence_mapper.csv-1 B
- 00README.txt-1 B