Frequency list of collocations from the Učbeniki 1.0 corpus

Name: Frequency list of collocations from the Učbeniki 1.0 corpus
License: https://creativecommons.org/licenses/by-sa/4.0/

Munda, Tina; Arhar Holdt, Špela; Kosem, Iztok; Pori, Eva; Krek, Simon

Show simple item record

dc.contributor.author	Munda, Tina
dc.contributor.author	Arhar Holdt, Špela
dc.contributor.author	Kosem, Iztok
dc.contributor.author	Pori, Eva
dc.contributor.author	Krek, Simon
dc.date.accessioned	2025-02-01T06:48:33Z
dc.date.available	2025-02-01T06:48:33Z
dc.date.issued	2025-01-31
dc.identifier.uri	http://hdl.handle.net/11356/2012
dc.description	The frequency list of collocations from the Slovene textbook corpus Učbeniki 1.0 was extracted with the CORDEX library (https://github.com/clarinsi/cordex/). The extraction is based on 82 predefined syntactic structures (cf. Krek et al., 2021) using the MULTEXT-East morphosyntactic (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and JOS-SYN dependency parsing (https://wiki.cjvt.si/books/06-jos-syn-syntax) annotations, where the latter serves as a syntactic complement to the former. The formal description of syntactic structures is included in the CORDEX library (see "structures_JOS.xml"). There are 2 output files: - "ucbeniki1.0_kolokacije.csv" contains the original output of collocations with absolute frequency 1 and above, corresponding to 82 predefined syntactic structures. The list is sorted by absolute frequency of collocations (Joint_representative_form) and includes frequency and POS information for each lemma of the collocation. The file also provides additional statistical measures (Delta_p12, Delta_p21, LogDice_core, LogDice_all) and shows the number of distinct forms in which the lemmas appear in the corpus for each collocation. - "ucbeniki1.0_kolokacije_collocation_sentence_mapper.csv" complements the file above by showing all occurrences of the extracted collocations in the corpus. Each row lists a collocation ID (matching the first file), identifies the sentence in which the collocation appears, and provides the exact tokens that form the collocation. The dataset can be used for analyses, especially in combination with comparable data (http://hdl.handle.net/11356/2011) from the develpmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589) to identify core student vocabulary. The data was prepared in the following manner: In the preprocessing phase, all individual Slovene school textbooks were merged into a single CoNLL-U file. Because the library then in use did not support Slovene MULTEXT-East morphosyntactic tags (MSD tags), these tags were converted into their English equivalents. Next, collocation data were extracted using the CORDEX library. Any collocations containing punctuation were excluded from the output. The lookup lexicon (https://www.clarin.si/repository/xmlui/handle/11356/1854) was used to improve collocation representations (applicable only when using the JOS system). In the postprocessing phase, the MSD tags in the output were translated back into Slovene MSD tags. For more details, see "00README.txt". --- KREK, Simon, GANTAR, Polona, KOSEM, Iztok, DOBROVOLJC, Kaja. Opis modela za pridobivanje in strukturiranje kolokacijskih podatkov iz korpusa. V: ARHAR HOLDT, Špela (ur.). Nova slovnica sodobne standardne slovenščine : viri in metode. 1. izd. Ljubljana: Znanstvena založba Filozofske fakultete, 2021. Str. 160-194, ilustr. Zbirka Sporazumevanje. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/325/477/7320
dc.language.iso	slv
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.publisher	Faculty of Arts, University of Ljubljana
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/prop/en/
dc.subject	textbook corpus
dc.subject	pedagogic corpus
dc.subject	student reading
dc.subject	collocation data
dc.subject	collocations
dc.subject	syntactic structures
dc.title	Frequency list of collocations from the Učbeniki 1.0 corpus
dc.type	lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType	wordList
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Tina Munda tina.munda@cjvt.si CJVT UL
sponsor	ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info	701943 entries
files.count	1
files.size	45268159