Show simple item record

 
dc.contributor.author Dobrovoljc, Kaja
dc.date.accessioned 2016-02-16T09:39:41Z
dc.date.available 2016-02-16T09:39:41Z
dc.date.issued 2015-07-01
dc.identifier.uri http://hdl.handle.net/11356/1053
dc.description This is a collection of n-grams extracted from the IMP corpus of historical Slovene (http://hdl.handle.net/11356/1031). In addition to the separate lists of n-grams for tokens and their attributes (modernised form, morphosyntacic tag, lemma), an adjusted frequency list with statistical substring reduction has also been added (as described in O'Donnell 2011). Only n-grams within sentences have been counted.
dc.language.iso slv
dc.publisher Trojina, Institute for Applied Slovene Studies
dc.publisher Faculty of Arts, University of Ljubljana
dc.relation.isreplacedby http://hdl.handle.net/11356/1194
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri http://nl.ijs.si/imp/index-en.html
dc.subject n-grams
dc.subject wordlist
dc.subject multiword expressions
dc.subject historical language
dc.title IMP corpus n-grams 1.0
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
hidden hidden
has.files yes
branding CLARIN.SI data & tools
contact.person Kaja Dobrovoljc kaja.dobrovoljc@gmail.com Trojina, Institute for Applied Slovene Studies
sponsor ARRS (Slovenian Research Agency) MR-36491 Young Researcher Programme nationalFunds
size.info 61 mb
size.info 2464719 items
files.count 4
files.size 17468333


 Files in this item

 Download all files in item (16.66 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
imp_ngrams_word_1-5.zip
Size
3.01 MB
Format
application/zip
Description
1- to 5-grams of words (historical spelling) excluding punctuation. The minimum frequency threshold is 5.
MD5
9f5ef25bde132c88d881ac3f9302d554
 Download file  Preview
 File Preview  
    • sorted_cut-5_imp_word_c-no_n-5_t-1.txt90 kB
    • sorted_cut-5_imp_word_c-no_n-4_t-1.txt555 kB
    • sorted_cut-5_imp_word_c-no_n-2_t-1.txt3 MB
    • sorted_cut-5_imp_word_c-no_n-3_t-1.txt2 MB
    • sorted_cut-5_imp_word_c-no_n-1_t-1.txt1 MB
Icon
Name
imp_ngrams_norm_1-5.zip
Size
3 MB
Format
application/zip
Description
1- to 5-grams of normalized words (modernised spelling) excluding punctuation. The minimum frequency threshold is 5.
MD5
7e673805f64b452c63071fe25a7d4635
 Download file  Preview
 File Preview  
    • sorted_cut-5_imp_lc_c-no_n-5_t-1.txt121 kB
    • sorted_cut-5_imp_lc_c-no_n-3_t-1.txt2 MB
    • sorted_cut-5_imp_lc_c-no_n-4_t-1.txt698 kB
    • sorted_cut-5_imp_lc_c-no_n-2_t-1.txt3 MB
    • sorted_cut-5_imp_lc_c-no_n-1_t-1.txt1 MB
Icon
Name
imp_ngrams_word-norm-lemma-tag_1-5.zip
Size
8.61 MB
Format
application/zip
Description
1- to 5-grams of words with normalized form, lemma and morphosyntactic tag including punctuation. The minimum frequency threshold is 5.
MD5
e0f78e9be4bc742c92c0d17e2d556f48
 Download file  Preview
 File Preview  
    • sorted_cut-5_imp_word-lc-lemma-tag_c-yes_n-4_t-1.txt5 MB
    • sorted_cut-5_imp_word-lc-lemma-tag_c-yes_n-5_t-1.txt1 MB
    • sorted_cut-5_imp_word-lc-lemma-tag_c-yes_n-3_t-1.txt11 MB
    • sorted_cut-5_imp_word-lc-lemma-tag_c-yes_n-2_t-1.txt15 MB
    • sorted_cut-5_imp_word-lc-lemma-tag_c-yes_n-1_t-1.txt4 MB
Icon
Name
imp_AFL_norm_1-5_min5M.zip
Size
2.04 MB
Format
application/zip
Description
Adjusted frequency list for 1- to 5-grams of normalized words (modernised spelling) excluding punctuation. The minimum relative frequency threshold for substring reduction is 5. Column 1: n-gram; column 2: length of n-gram, column 3: adjusted corpus frequency.
MD5
79c94b6f4f462c7797e52724165087b7
 Download file  Preview
 File Preview  
    • sorted_cut-1_AFL_imp_lc_c-no_n-5_t-75.txt5 MB

Show simple item record