Show simple item record

 
dc.contributor.author Dobrovoljc, Kaja
dc.date.accessioned 2018-08-03T18:46:37Z
dc.date.available 2018-08-03T18:46:37Z
dc.date.issued 2018-08-03
dc.identifier.uri http://hdl.handle.net/11356/1195
dc.description A collection of n-grams extracted from the Gos corpus of spoken Slovene (cf. http://eng.slovenscina.eu/korpusi/gos). Three sets of n-gram lists are provided for lowercased word n-grams of length 1 to 5: - extensive frequency lists of all extracted n-grams - filtered frequency lists of n-grams with minimum frequency 10/mil. - adjusted frequency list of all n-grams with minimum frequency 10/mil. Only n-grams within sentences have been counted, ignoring punctuation. For the filtered and adjusted list, only n-grams occurring in at least 2 different texts have been extracted. Key references: - K. Dobrovoljc, 2018. N-gram frequency lists for reference corpora of Slovenian language. Proceedings of the Language Technologies & Digital Humanities Conference 2018. - D. Verdonik, I. Kosem, A. Zwitter Vitez, S. Krek, M. Stabej, 2013. Compilation, transcription and usage of a reference speech corpus: The case of the Slovene corpus GOS. Language resources and evaluation, 47 (4), pp. 1031-1048, doi: 10.1007/s10579-013-9216-5. - M. B. O’Donnell, 2010. The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal 35, 135–169.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby http://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Dobrovoljc-K_Frekvencni-seznami-n-gramov-v-korpusih-slovenskega-jezika.pdf
dc.relation.replaces http://hdl.handle.net/11356/1046
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.subject n-grams
dc.subject wordlist
dc.subject multiword expressions
dc.subject spoken corpus
dc.title Gos corpus n-grams 2.0
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Kaja Dobrovoljc kaja.dobrovoljc@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) MR-36491 Young Researcher Programme nationalFunds
sponsor ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
size.info 62710 unigrams
size.info 394416 bigrams
size.info 692260 trigrams
size.info 750559 4-grams
size.info 698208 5-grams
size.info 2598153 n-grams
files.count 3
files.size 22035855


 Files in this item

 Download all files in item (21.02 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
all_1-5-grams_Gos.zip
Size
20.83 MB
Format
application/zip
Description
Collection of all n-grams.
MD5
48e9597d1a39f29e2c65f38ef236c41a
 Download file  Preview
 File Preview  
    • gos_lc_c-no_n-3_t-1_x-1.txt13 MB
    • gos_lc_c-no_n-5_t-1_x-1.txt19 MB
    • gos_lc_c-no_n-2_t-1_x-1.txt6 MB
    • gos_lc_c-no_n-4_t-1_x-1.txt17 MB
    • gos_lc_c-no_n-1_t-1_x-1.txt802 kB
Icon
Name
filtered_1-5-grams_Gos.zip
Size
101.68 KB
Format
application/zip
Description
Collection of n-grams above frequency 10/mil.
MD5
b000a8afda2b5181d6c6bfbc0ff55206
 Download file  Preview
 File Preview  
    • gos_lc_c-no_n-5_t-10_x-2.txt1 kB
    • gos_lc_c-no_n-4_t-10_x-2.txt5 kB
    • gos_lc_c-no_n-3_t-10_x-2.txt54 kB
    • gos_lc_c-no_n-2_t-10_x-2.txt135 kB
    • gos_lc_c-no_n-1_t-10_x-2.txt87 kB
Icon
Name
adjusted_1-5-grams_Gos.zip
Size
91.73 KB
Format
application/zip
Description
List of all n-grams with adjusted frequency above 10/mil.
MD5
2c5fe866fde6ff1cbc3cb76255d93039
 Download file  Preview
 File Preview  
    • AFL_gos_lc_c-no_n-5_t-10_x-2.txt310 kB

Show simple item record