Show simple item record

 
dc.contributor.author Čibej, Jaka
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Krek, Simon
dc.date.accessioned 2019-11-13T08:53:47Z
dc.date.available 2019-11-13T08:53:47Z
dc.date.issued 2019-11-18
dc.identifier.uri http://hdl.handle.net/11356/1271
dc.description Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams occurring in the corpus along with their absolute and relative frequencies, percentages, distribution across the text-types included in the corpus taxonomy, and five collocation measures: Dice, t-score, MI, MI3, logDice, and simple LL. The n-grams were extracted from lower-case word forms, normalized word forms, and morphosyntactic tags. For large lists, shortened versions with the first 150,000 lines were also prepared to facilitate further processing in spreadsheet analysis software.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.publisher Jožef Stefan Institute
dc.relation.isreplacedby http://hdl.handle.net/11356/1365
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri http://slovnica.ijs.si/
dc.subject n-grams
dc.subject words
dc.subject word forms
dc.subject normalized forms
dc.subject spoken corpus
dc.subject word sets
dc.subject morphosyntactic tags
dc.title Frequency lists of word-level n-grams from the GOS 1.0 corpus
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
hidden hidden
has.files yes
branding CLARIN.SI data & tools
contact.person Jaka Čibej jaka.cibej@cjvt.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
sponsor Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other
size.info 23 files
files.count 3
files.size 301486297


 Files in this item

 Download all files in item (287.52 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
GOS1.0-word_sets-lowercase_forms.zip
Size
110.06 MB
Format
application/zip
Description
Frequency of word-level n-grams from lower-case word forms in GOS 1.0
MD5
a2622c774032e053699306730e12f6e0
 Download file  Preview
 File Preview  
    • GOS1.0-word_sets-lowercase_forms-2grams-taxonomy-collocativity-entire.tsv72 MB
    • GOS1.0-word_sets-lowercase_forms-3grams-taxonomy-collocativity-entire.tsv118 MB
    • GOS1.0-word_sets-lowercase_forms-4grams-taxonomy-collocativity-entire.tsv123 MB
    • GOS1.0-word_sets-lowercase_forms-3grams-taxonomy-collocativity-short.tsv25 MB
    • GOS1.0-word_sets-lowercase_forms-5grams-taxonomy-collocativity-entire.tsv114 MB
    • GOS1.0-word_sets-lowercase_forms-2grams-taxonomy-collocativity-short.tsv24 MB
    • GOS1.0-word_sets-lowercase_forms-5grams-taxonomy-collocativity-short.tsv25 MB
    • GOS1.0-word_sets-lowercase_forms-4grams-taxonomy-collocativity-short.tsv25 MB
Icon
Name
GOS1.0-word_sets-morphosyntactic_tags.zip
Size
69.69 MB
Format
application/zip
Description
Frequency lists of word-level n-grams from morphosyntactic tags in GOS 1.0
MD5
00c664d9d554d8d816b959a3e8797928
 Download file  Preview
 File Preview  
    • GOS1.0-word_sets-morphosyntactic_tags-3grams-taxonomy-collocativity-short.tsv27 MB
    • GOS1.0-word_sets-morphosyntactic_tags-2grams-taxonomy-collocativity-entire.tsv8 MB
    • GOS1.0-word_sets-morphosyntactic_tags-3grams-taxonomy-collocativity-entire.tsv49 MB
    • GOS1.0-word_sets-morphosyntactic_tags-2grams-taxonomy-collocativity-short.tsv8 MB
    • GOS1.0-word_sets-morphosyntactic_tags-4grams-taxonomy-collocativity-entire.tsv101 MB
    • GOS1.0-word_sets-morphosyntactic_tags-5grams-taxonomy-collocativity-short.tsv28 MB
    • GOS1.0-word_sets-morphosyntactic_tags-5grams-taxonomy-collocativity-entire.tsv122 MB
    • GOS1.0-word_sets-morphosyntactic_tags-4grams-taxonomy-collocativity-short.tsv27 MB
Icon
Name
GOS1.0-word_sets-normalized_forms.zip
Size
107.77 MB
Format
application/zip
Description
Frequency lists of word-level n-grams from normalized word forms in GOS 1.0
MD5
8df609df4a850d57f7f1f038d317fc1d
 Download file  Preview
 File Preview  
    • GOS1.0-word_sets-normalized_forms-3grams-taxonomy-collocativity-short.tsv25 MB
    • GOS1.0-word_sets-normalized_forms-2grams-taxonomy-collocativity-entire.tsv64 MB
    • GOS1.0-word_sets-normalized_forms-2grams-taxonomy-collocativity-short.tsv24 MB
    • GOS1.0-word_sets-normalized_forms-3grams-taxonomy-collocativity-entire.tsv113 MB
    • GOS1.0-word_sets-normalized_forms-4grams-taxonomy-collocativity-entire.tsv122 MB
    • GOS1.0-word_sets-normalized_forms-5grams-taxonomy-collocativity-short.tsv26 MB
    • GOS1.0-word_sets-normalized_forms-5grams-taxonomy-collocativity-entire.tsv115 MB
    • GOS1.0-word_sets-normalized_forms-4grams-taxonomy-collocativity-short.tsv25 MB

Show simple item record