Show simple item record

 
dc.contributor.author Čibej, Jaka
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Krek, Simon
dc.date.accessioned 2020-11-02T12:39:52Z
dc.date.available 2020-11-02T12:39:52Z
dc.date.issued 2020-10-28
dc.identifier.uri http://hdl.handle.net/11356/1365
dc.description Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams occurring in the corpus along with their absolute and relative frequencies, percentages, distribution across the text-types included in the corpus taxonomy, and five collocation measures: Dice, t-score, MI, MI3, logDice, and simple LL. The n-grams were extracted from lower-case word forms, standardized word forms, and morphosyntactic tags. For large lists, shortened versions with the first 150,000 lines were also prepared to facilitate further processing in spreadsheet analysis software. Compared to the previous version (http://hdl.handle.net/11356/1271), this one includes fixes of several typos and substitutes all instances of "normalized forms" with the more adequate term "standardized forms" (as used in the SSJ project).
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.publisher Jožef Stefan Institute
dc.relation.replaces http://hdl.handle.net/11356/1271
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri http://slovnica.ijs.si/
dc.subject n-grams
dc.subject words
dc.subject word forms
dc.subject spoken corpus
dc.subject word sets
dc.subject morphosyntactic tags
dc.subject standardized forms
dc.title Frequency lists of word-level n-grams from the GOS 1.0 corpus 1.1
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Jaka Čibej jaka.cibej@cjvt.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
sponsor Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other
size.info 23 files
files.count 3
files.size 301486577


 Files in this item

 Download all files in item (287.52 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
GOS1.0-word_sets-lowercase_forms.zip
Size
110.06 MB
Format
application/zip
Description
Frequency of word-level n-grams from lower-case word forms in GOS 1.0
MD5
2cea7e5603afc581ed9466e5eea92d1b
 Download file  Preview
 File Preview  
    • GOS1.0-word_sets-lowercase_forms-2grams-taxonomy-collocativity-entire.tsv72 MB
    • GOS1.0-word_sets-lowercase_forms-4grams-taxonomy-collocativity-entire.tsv123 MB
    • GOS1.0-word_sets-lowercase_forms-3grams-taxonomy-collocativity-entire.tsv118 MB
    • GOS1.0-word_sets-lowercase_forms-3grams-taxonomy-collocativity-short.tsv25 MB
    • GOS1.0-word_sets-lowercase_forms-5grams-taxonomy-collocativity-entire.tsv114 MB
    • GOS1.0-word_sets-lowercase_forms-5grams-taxonomy-collocativity-short.tsv25 MB
    • GOS1.0-word_sets-lowercase_forms-2grams-taxonomy-collocativity-short.tsv24 MB
    • GOS1.0-word_sets-lowercase_forms-4grams-taxonomy-collocativity-short.tsv25 MB
Icon
Name
GOS1.0-word_sets-morphosyntactic_tags.zip
Size
69.69 MB
Format
application/zip
Description
Frequency lists of word-level n-grams from morphosyntactic tags in GOS 1.0
MD5
13c8d2de23f4ce266294316d0576d721
 Download file  Preview
 File Preview  
    • GOS1.0-word_sets-morphosyntactic_tags-3grams-taxonomy-collocativity-short.tsv27 MB
    • GOS1.0-word_sets-morphosyntactic_tags-3grams-taxonomy-collocativity-entire.tsv49 MB
    • GOS1.0-word_sets-morphosyntactic_tags-2grams-taxonomy-collocativity-entire.tsv8 MB
    • GOS1.0-word_sets-morphosyntactic_tags-5grams-taxonomy-collocativity-short.tsv28 MB
    • GOS1.0-word_sets-morphosyntactic_tags-4grams-taxonomy-collocativity-entire.tsv101 MB
    • GOS1.0-word_sets-morphosyntactic_tags-2grams-taxonomy-collocativity-short.tsv8 MB
    • GOS1.0-word_sets-morphosyntactic_tags-5grams-taxonomy-collocativity-entire.tsv122 MB
    • GOS1.0-word_sets-morphosyntactic_tags-4grams-taxonomy-collocativity-short.tsv27 MB
Icon
Name
GOS1.0-word_sets-standardized_forms.zip
Size
107.77 MB
Format
application/zip
Description
Frequency lists of word-level n-grams from standardized word forms in GOS 1.0
MD5
29ff8ff8cf9e1ff778fe77bd5f0a4e62
 Download file  Preview
 File Preview  
    • GOS1.0-word_sets-standardized_forms-4grams-taxonomy-collocativity-entire.tsv122 MB
    • GOS1.0-word_sets-standardized_forms-4grams-taxonomy-collocativity-short.tsv25 MB
    • GOS1.0-word_sets-standardized_forms-5grams-taxonomy-collocativity-entire.tsv115 MB
    • GOS1.0-word_sets-standardized_forms-3grams-taxonomy-collocativity-short.tsv25 MB
    • GOS1.0-word_sets-standardized_forms-5grams-taxonomy-collocativity-short.tsv26 MB
    • GOS1.0-word_sets-standardized_forms-3grams-taxonomy-collocativity-entire.tsv113 MB
    • GOS1.0-word_sets-standardized_forms-2grams-taxonomy-collocativity-short.tsv24 MB
    • GOS1.0-word_sets-standardized_forms-2grams-taxonomy-collocativity-entire.tsv64 MB

Show simple item record