2026-04-30T23:48:09Zhttp://www.clarin.si/repository/oai/request

oai:www.clarin.si:11356/12712023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024

Frequency lists of word-level n-grams from the GOS 1.0 corpus Čibej, Jaka Arhar Holdt, Špela Dobrovoljc, Kaja Krek, Simon n-grams words word forms normalized forms spoken corpus word sets morphosyntactic tags Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams occurring in the corpus along with their absolute and relative frequencies, percentages, distribution across the text-types included in the corpus taxonomy, and five collocation measures: Dice, t-score, MI, MI3, logDice, and simple LL. The n-grams were extracted from lower-case word forms, normalized word forms, and morphosyntactic tags. For large lists, shortened versions with the first 150,000 lines were also prepared to facilitate further processing in spreadsheet analysis software. 2019-11-18 lexicalConceptualResource http://hdl.handle.net/11356/1271 slv http://hdl.handle.net/11356/1365 Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) https://creativecommons.org/licenses/by-sa/4.0/ PUB application/zip application/zip application/zip text/plain; charset=utf-8 downloadable_files_count: 3 Centre for Language Resources and Technologies, University of Ljubljana Jožef Stefan Institute http://slovnica.ijs.si/