Prikaži enostavni zapis vnosa

 
dc.contributor.author Čibej, Jaka
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Krek, Simon
dc.date.accessioned 2019-11-13T09:04:15Z
dc.date.available 2019-11-13T09:04:15Z
dc.date.issued 2019-11-18
dc.identifier.uri http://hdl.handle.net/11356/1274
dc.description Frequency lists of word-level n-grams (or word sets) were extracted from the Gigafida 2.0 Corpus of Written Standard Slovene (https://viri.cjvt.si/gigafida/) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams with minimum relative frequency of 2 per million occurring in the corpus, along with their absolute and relative frequencies, percentages, distribution across the text-types included in the corpus taxonomy, and five collocation measures: Dice, t-score, MI, MI3, logDice, and simple LL. The n-grams were extracted from lower-case word forms and morphosyntactic tags. For large lists, shortened versions with the first 150,000 lines were also prepared to facilitate further processing in spreadsheet analysis software.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri http://slovnica.ijs.si/
dc.subject n-grams
dc.subject standard language
dc.subject lemmas
dc.subject morphosyntactic tags
dc.subject normalized word forms
dc.subject word sets
dc.title Frequency lists of word-level n-grams from the Gigafida 2.0 corpus
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Jaka Čibej jaka.cibej@cjvt.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
files.count 1
files.size 22327366


 Datoteke v tem vnosu

Icon
Ime
GF2.0-word_sets.zip
Velikost
21.29 MB
Format
application/zip
Opis
Frequency lists of word-level n-grams from Gigafida 2.0 with minimum relative frequency of 2/million
MD5
22e911e80ecfd2cde4458acd74d83b4b
 Prenesi datoteko  Predogled
 Predogled datoteke  
    • GF2.0-word_sets-lowercase_forms-morphosyntactic_tags-3grams-taxonomy-collocativity-entire.tsv2 MB
    • GF2.0-word_sets-lowercase_forms-3grams-taxonomy-collocativity-entire.tsv2 MB
    • GF2.0-word_sets-lowercase_forms-lemmas-2grams-taxonomy-collocativity-entire.tsv10 MB
    • GF2.0-word_sets-morphosyntactic_tags-parts_of_speech-4grams_taksonomija-collocativity-entire.tsv12 MB
    • GF2.0-word_sets-lowercase_forms-lemmas-5grams-taxonomy-collocativity-entire.tsv44 kB
    • GF2.0-word_sets-lowercase_forms-morphosyntactic_tags-2grams-taxonomy-collocativity-entire.tsv9 MB
    • GF2.0-word_sets-lowercase_forms-2grams-taxonomy-collocativity-entire.tsv9 MB
    • GF2.0-word_sets-lowercase_forms-morphosyntactic_tags-5grams-taxonomy-collocativity-entire.tsv38 kB
    • GF2.0-word_sets-morphosyntactic_tags-parts_of_speech-2grams_taksonomija-collocativity-entire.tsv5 MB
    • GF2.0-word_sets-lowercase_forms-5grams-taxonomy-collocativity-entire.tsv36 kB
    • GF2.0-word_sets-lowercase_forms-lemmas-4grams-taxonomy-collocativity-entire.tsv439 kB
    • GF2.0-word_sets-lowercase_forms-morphosyntactic_tags-4grams-taxonomy-collocativity-entire.tsv392 kB
    • GF2.0-word_sets-lowercase_forms-4grams-taxonomy-collocativity-entire.tsv383 kB
    • GF2.0-word_sets-morphosyntactic_tags-parts_of_speech-3grams_taksonomija-collocativity-entire.tsv14 MB
    • GF2.0-word_sets-lowercase_forms-lemmas-3grams-taxonomy-collocativity-entire.tsv3 MB
    • GF2.0-word_sets-morphosyntactic_tags-parts_of_speech-5grams_taksonomija-collocativity-entire.tsv2 MB

Prikaži enostavni zapis vnosa