dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Dobrovoljc, Kaja |
dc.contributor.author | Krek, Simon |
dc.date.accessioned | 2020-11-02T12:39:52Z |
dc.date.available | 2020-11-02T12:39:52Z |
dc.date.issued | 2020-10-28 |
dc.identifier.uri | http://hdl.handle.net/11356/1365 |
dc.description | Frequency lists of word-level n-grams (or word sets) were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams occurring in the corpus along with their absolute and relative frequencies, percentages, distribution across the text-types included in the corpus taxonomy, and five collocation measures: Dice, t-score, MI, MI3, logDice, and simple LL. The n-grams were extracted from lower-case word forms, standardized word forms, and morphosyntactic tags. For large lists, shortened versions with the first 150,000 lines were also prepared to facilitate further processing in spreadsheet analysis software. Compared to the previous version (http://hdl.handle.net/11356/1271), this one includes fixes of several typos and substitutes all instances of "normalized forms" with the more adequate term "standardized forms" (as used in the SSJ project). |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.publisher | Jožef Stefan Institute |
dc.relation.replaces | http://hdl.handle.net/11356/1271 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://slovnica.ijs.si/ |
dc.subject | n-grams |
dc.subject | words |
dc.subject | word forms |
dc.subject | spoken corpus |
dc.subject | word sets |
dc.subject | morphosyntactic tags |
dc.subject | standardized forms |
dc.title | Frequency lists of word-level n-grams from the GOS 1.0 corpus 1.1 |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | wordList |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Jaka Čibej jaka.cibej@cjvt.si Centre for Language Resources and Technologies, University of Ljubljana |
sponsor | ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds |
sponsor | Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other |
size.info | 23 files |
files.count | 3 |
files.size | 301486577 |
Datoteke v tem vnosu
Prenesi vse datoteke v vnosu (287.52 MB)To je vnos
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Ime
- GOS1.0-word_sets-lowercase_forms.zip
- Velikost
- 110.06 MB
- Format
- application/zip
- Opis
- Frequency of word-level n-grams from lower-case word forms in GOS 1.0
- MD5
- 2cea7e5603afc581ed9466e5eea92d1b
- GOS1.0-word_sets-lowercase_forms-2grams-taxonomy-collocativity-entire.tsv72 MB
- GOS1.0-word_sets-lowercase_forms-4grams-taxonomy-collocativity-entire.tsv123 MB
- GOS1.0-word_sets-lowercase_forms-3grams-taxonomy-collocativity-entire.tsv118 MB
- GOS1.0-word_sets-lowercase_forms-3grams-taxonomy-collocativity-short.tsv25 MB
- GOS1.0-word_sets-lowercase_forms-5grams-taxonomy-collocativity-entire.tsv114 MB
- GOS1.0-word_sets-lowercase_forms-5grams-taxonomy-collocativity-short.tsv25 MB
- GOS1.0-word_sets-lowercase_forms-2grams-taxonomy-collocativity-short.tsv24 MB
- GOS1.0-word_sets-lowercase_forms-4grams-taxonomy-collocativity-short.tsv25 MB

- Ime
- GOS1.0-word_sets-morphosyntactic_tags.zip
- Velikost
- 69.69 MB
- Format
- application/zip
- Opis
- Frequency lists of word-level n-grams from morphosyntactic tags in GOS 1.0
- MD5
- 13c8d2de23f4ce266294316d0576d721
- GOS1.0-word_sets-morphosyntactic_tags-3grams-taxonomy-collocativity-short.tsv27 MB
- GOS1.0-word_sets-morphosyntactic_tags-3grams-taxonomy-collocativity-entire.tsv49 MB
- GOS1.0-word_sets-morphosyntactic_tags-2grams-taxonomy-collocativity-entire.tsv8 MB
- GOS1.0-word_sets-morphosyntactic_tags-5grams-taxonomy-collocativity-short.tsv28 MB
- GOS1.0-word_sets-morphosyntactic_tags-4grams-taxonomy-collocativity-entire.tsv101 MB
- GOS1.0-word_sets-morphosyntactic_tags-2grams-taxonomy-collocativity-short.tsv8 MB
- GOS1.0-word_sets-morphosyntactic_tags-5grams-taxonomy-collocativity-entire.tsv122 MB
- GOS1.0-word_sets-morphosyntactic_tags-4grams-taxonomy-collocativity-short.tsv27 MB

- Ime
- GOS1.0-word_sets-standardized_forms.zip
- Velikost
- 107.77 MB
- Format
- application/zip
- Opis
- Frequency lists of word-level n-grams from standardized word forms in GOS 1.0
- MD5
- 29ff8ff8cf9e1ff778fe77bd5f0a4e62
- GOS1.0-word_sets-standardized_forms-4grams-taxonomy-collocativity-entire.tsv122 MB
- GOS1.0-word_sets-standardized_forms-4grams-taxonomy-collocativity-short.tsv25 MB
- GOS1.0-word_sets-standardized_forms-5grams-taxonomy-collocativity-entire.tsv115 MB
- GOS1.0-word_sets-standardized_forms-3grams-taxonomy-collocativity-short.tsv25 MB
- GOS1.0-word_sets-standardized_forms-5grams-taxonomy-collocativity-short.tsv26 MB
- GOS1.0-word_sets-standardized_forms-3grams-taxonomy-collocativity-entire.tsv113 MB
- GOS1.0-word_sets-standardized_forms-2grams-taxonomy-collocativity-short.tsv24 MB
- GOS1.0-word_sets-standardized_forms-2grams-taxonomy-collocativity-entire.tsv64 MB