Show simple item record

 
dc.contributor.author Dobrovoljc, Kaja
dc.date.accessioned 2015-07-23T07:41:49Z
dc.date.available 2015-07-23T07:41:49Z
dc.date.issued 2015-07-01
dc.identifier.uri http://hdl.handle.net/11356/1045
dc.description This is a collection of n-grams extracted from the KRES corpus of written Slovene. In addition to the separate lists of n-grams for tokens and their attributes (morphosyntacic tag, lemma), an adjusted frequency list with statistical substring reduction has also been added (as described in O'Donnell 2011). Only n-grams within sentences have been counted.
dc.language.iso slv
dc.publisher Trojina, Institute for Applied Slovene Studies
dc.publisher Faculty of Arts, University of Ljubljana
dc.relation.isreplacedby http://hdl.handle.net/11356/1193
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.source.uri http://eng.slovenscina.eu/korpusi/kres
dc.subject n-grams
dc.subject wordlist
dc.subject multiword expressions
dc.title KRES corpus n-grams 1.0
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
hidden hidden
has.files yes
branding CLARIN.SI data & tools
contact.person Kaja Dobrovoljc kaja.dobrovoljc@gmail.com Trojina, Institute for Applied Slovene Studies
sponsor ARRS (Slovenian Research Agency) MR-36491 Young Researcher Programme nationalFunds
size.info 193 mb
size.info 8011997 n-grams
files.count 4
files.size 66031296


 Files in this item

 Download all files in item (62.97 MB)
Icon
Name
kres_ngrams_word_1-5.zip
Size
12.44 MB
Format
application/zip
Description
1- to 5-grams of words excluding punctuation. The minimum frequency threshold is 10.
MD5
a7af448ffb1f4dc99f22a8534f587ce6
 Download file  Preview
 File Preview  
    • sorted_cut-10_kres_word_c-no_n-5_t-1.txt1 MB
    • sorted_cut-10_kres_word_c-no_n-4_t-1.txt3 MB
    • sorted_cut-10_kres_word_c-no_n-3_t-1.txt9 MB
    • sorted_cut-10_kres_word_c-no_n-1_t-1.txt3 MB
    • sorted_cut-10_kres_word_c-no_n-2_t-1.txt14 MB
Icon
Name
kres_ngrams_lc_1-5.zip
Size
12.05 MB
Format
application/zip
Description
1- to 5-grams of words in lowercase excluding punctuation. The minimum frequency threshold is 10.
MD5
6f4a20e78c01bee29840108d0268856c
 Download file  Preview
 File Preview  
    • sorted_cut-10_kres_lc_c-no_n-3_t-1.txt9 MB
    • sorted_cut-10_kres_lc_c-no_n-1_t-1.txt3 MB
    • sorted_cut-10_kres_lc_c-no_n-2_t-1.txt14 MB
    • sorted_cut-10_kres_lc_c-no_n-5_t-1.txt1 MB
    • sorted_cut-10_kres_lc_c-no_n-4_t-1.txt4 MB
Icon
Name
kres_ngrams_word-lemma-tag_1-5.zip
Size
30.92 MB
Format
application/zip
Description
1- to 5-grams of words with lemma and morphosyntactic tag including punctuation. The minimum frequency threshold is 10.
MD5
0d505febc737cc44210813910f5ac90b
 Download file  Preview
 File Preview  
    • sorted_cut-10_kres_word-lemma-tag_c-yes_n-1_t-1.txt9 MB
    • sorted_cut-10_kres_word-lemma-tag_c-yes_n-5_t-1.txt9 MB
    • sorted_cut-10_kres_word-lemma-tag_c-yes_n-3_t-1.txt32 MB
    • sorted_cut-10_kres_word-lemma-tag_c-yes_n-4_t-1.txt18 MB
    • sorted_cut-10_kres_word-lemma-tag_c-yes_n-2_t-1.txt38 MB
Icon
Name
kres_AFL_lc_1-5_min5M.zip
Size
7.57 MB
Format
application/zip
Description
Adjusted frequency list for 1- to 5-grams of words in lowercase excluding punctuation. The minimum relative frequency threshold for substring reduction is 5. Column 1: n-gram; column 2: length of n-gram, column 3: corpus frequency.
MD5
9e3b90ec6feb43ea44df815e4e4b078e
 Download file  Preview
 File Preview  
    • AFL_kres_lc_c-no_n-5_t-500.txt19 MB

Show simple item record