Prikaži enostavni zapis vnosa

 
dc.contributor.author Munda, Tina
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Kosem, Iztok
dc.contributor.author Pori, Eva
dc.contributor.author Krek, Simon
dc.date.accessioned 2025-02-01T06:48:33Z
dc.date.available 2025-02-01T06:48:33Z
dc.date.issued 2025-01-31
dc.identifier.uri http://hdl.handle.net/11356/2012
dc.description The frequency list of collocations from the Slovene textbook corpus Učbeniki 1.0 was extracted with the CORDEX library (https://github.com/clarinsi/cordex/). The extraction is based on 82 predefined syntactic structures (cf. Krek et al., 2021) using the MULTEXT-East morphosyntactic (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and JOS-SYN dependency parsing (https://wiki.cjvt.si/books/06-jos-syn-syntax) annotations, where the latter serves as a syntactic complement to the former. The formal description of syntactic structures is included in the CORDEX library (see "structures_JOS.xml"). There are 2 output files: - "ucbeniki1.0_kolokacije.csv" contains the original output of collocations with absolute frequency 1 and above, corresponding to 82 predefined syntactic structures. The list is sorted by absolute frequency of collocations (Joint_representative_form) and includes frequency and POS information for each lemma of the collocation. The file also provides additional statistical measures (Delta_p12, Delta_p21, LogDice_core, LogDice_all) and shows the number of distinct forms in which the lemmas appear in the corpus for each collocation. - "ucbeniki1.0_kolokacije_collocation_sentence_mapper.csv" complements the file above by showing all occurrences of the extracted collocations in the corpus. Each row lists a collocation ID (matching the first file), identifies the sentence in which the collocation appears, and provides the exact tokens that form the collocation. The dataset can be used for analyses, especially in combination with comparable data (http://hdl.handle.net/11356/2011) from the develpmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589) to identify core student vocabulary. The data was prepared in the following manner: In the preprocessing phase, all individual Slovene school textbooks were merged into a single CoNLL-U file. Because the library then in use did not support Slovene MULTEXT-East morphosyntactic tags (MSD tags), these tags were converted into their English equivalents. Next, collocation data were extracted using the CORDEX library. Any collocations containing punctuation were excluded from the output. The lookup lexicon (https://www.clarin.si/repository/xmlui/handle/11356/1854) was used to improve collocation representations (applicable only when using the JOS system). In the postprocessing phase, the MSD tags in the output were translated back into Slovene MSD tags. For more details, see "00README.txt". --- KREK, Simon, GANTAR, Polona, KOSEM, Iztok, DOBROVOLJC, Kaja. Opis modela za pridobivanje in strukturiranje kolokacijskih podatkov iz korpusa. V: ARHAR HOLDT, Špela (ur.). Nova slovnica sodobne standardne slovenščine : viri in metode. 1. izd. Ljubljana: Znanstvena založba Filozofske fakultete, 2021. Str. 160-194, ilustr. Zbirka Sporazumevanje. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/325/477/7320
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.publisher Faculty of Arts, University of Ljubljana
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://www.cjvt.si/prop/en/
dc.subject textbook corpus
dc.subject pedagogic corpus
dc.subject student reading
dc.subject collocation data
dc.subject collocations
dc.subject syntactic structures
dc.title Frequency list of collocations from the Učbeniki 1.0 corpus
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tina Munda tina.munda@cjvt.si CJVT UL
sponsor ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 701943 entries
files.count 1
files.size 45268159


 Datoteke v tem vnosu

Icon
Ime
ucbeniki1.0_kolokacije.zip
Velikost
43.17 MB
Format
application/zip
Opis
Collocations from Učbeniki 1.0
MD5
bf0441e3c594c7d59014a1d8ce6ac592
 Prenesi datoteko  Predogled
 Predogled datoteke  
  • ucbeniki1.0_kolokacije
    • ucbeniki1.0_kolokacije.csv-1 B
    • ucbeniki1.0_kolokacije_collocations_sentence_mapper.csv-1 B
    • 00README.txt-1 B

Prikaži enostavni zapis vnosa