Prikaži enostavni zapis vnosa

 
dc.contributor.author Dobrovoljc, Kaja
dc.date.accessioned 2018-08-01T17:32:32Z
dc.date.available 2018-08-01T17:32:32Z
dc.date.issued 2018-08-01
dc.identifier.uri http://hdl.handle.net/11356/1192
dc.description A collection of n-grams extracted from the Janes corpus of Slovenian user-generated content version 1.0 (cf. http://nl.ijs.si/janes/). Three sets of n-gram lists are provided for lowercased word n-grams of length 1 to 5: - extensive frequency lists of all extracted n-grams - filtered frequency lists of n-grams with minimum frequency 10/mil. - adjusted frequency list of all n-grams with minimum frequency 10/mil. Only n-grams within sentences have been counted, ignoring punctuation. For the filtered and adjusted list, only n-grams occurring in at least 2 different texts have been extracted. Key references: - K. Dobrovoljc, 2018. N-gram frequency lists for reference corpora of Slovenian language. Proceedings of the Language Technologies & Digital Humanities Conference 2018. - T. Erjavec, N. Ljubešić, D. Fišer, 2018. Korpus slovenskih spletnih uporabniških vsebin Janes. V: FIŠER, Darja (ur.). Viri, orodja in metode za analizo spletne slovenščine. Znanstvena založba Filozofske fakultete Univerze v Ljubljani. https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/book/111 - M. B. O’Donnell, 2010. The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal 35, 135–169.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby http://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Dobrovoljc-K_Frekvencni-seznami-n-gramov-v-korpusih-slovenskega-jezika.pdf
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.subject n-grams
dc.subject wordlist
dc.subject multiword expressions
dc.title Janes corpus n-grams 1.0
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Kaja Dobrovoljc kaja.dobrovoljc@cjvt.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) MR-36491 Young Researcher Programme nationalFunds
sponsor ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
size.info 2502460 unigrams
size.info 35969381 bigrams
size.info 89128455 trigrams
size.info 113108440 4-grams
size.info 110320967 5-grams
size.info 351029703 n-grams
files.count 3
files.size 4029924426


 Datoteke v tem vnosu

Icon
Ime
filtered_1-5-grams_Janes.zip
Velikost
150.06 KB
Format
application/zip
Opis
List of n-grams with frequency above 10/mil.
MD5
9355cef9e75b71c0a1820d4d8f12fc25
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • janes_lc_c-no_n-1_t-1913_x-2.txt160 kB
    • janes_lc_c-no_n-5_t-1913_x-2.txt6 kB
    • janes_lc_c-no_n-4_t-1913_x-2.txt8 kB
    • janes_lc_c-no_n-3_t-1913_x-2.txt30 kB
    • janes_lc_c-no_n-2_t-1913_x-2.txt120 kB
Icon
Ime
adjusted_1-5-grams_Janes.zip
Velikost
111.83 KB
Format
application/zip
Opis
List of n-grams with adjusted frequency above 10/mil.
MD5
0a4c930ec2a20804ca92a6f9838fbe92
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • AFL_janes_lc_c-no_n-5_t-1913_x-2.txt330 kB
Icon
Ime
all_1-5-grams_Janes.zip
Velikost
3.75 GB
Format
application/zip
Opis
Collection of all n-grams
MD5
07f3b77b3d96d6abd2f8cc017ad0ab0e
 Prenesi datoteko  Predogled
 Predogled datoteke  
  • all_1-5-grams_Janes
    • unsorted_janes_lc_c-no_n-4_t-1_x-1.txt2 GB
    • unsorted_janes_lc_c-no_n-1_t-1_x-1.txt37 MB
    • unsorted_janes_lc_c-no_n-3_t-1_x-1.txt1 GB
    • unsorted_janes_lc_c-no_n-5_t-1_x-1.txt3 GB
    • unsorted_janes_lc_c-no_n-2_t-1_x-1.txt655 MB

Prikaži enostavni zapis vnosa