dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Dobrovoljc, Kaja |
dc.contributor.author | Krek, Simon |
dc.date.accessioned | 2019-11-13T09:04:15Z |
dc.date.available | 2019-11-13T09:04:15Z |
dc.date.issued | 2019-11-18 |
dc.identifier.uri | http://hdl.handle.net/11356/1274 |
dc.description | Frequency lists of word-level n-grams (or word sets) were extracted from the Gigafida 2.0 Corpus of Written Standard Slovene (https://viri.cjvt.si/gigafida/) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams with minimum relative frequency of 2 per million occurring in the corpus, along with their absolute and relative frequencies, percentages, distribution across the text-types included in the corpus taxonomy, and five collocation measures: Dice, t-score, MI, MI3, logDice, and simple LL. The n-grams were extracted from lower-case word forms and morphosyntactic tags. For large lists, shortened versions with the first 150,000 lines were also prepared to facilitate further processing in spreadsheet analysis software. |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.publisher | Jožef Stefan Institute |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://slovnica.ijs.si/ |
dc.subject | n-grams |
dc.subject | standard language |
dc.subject | lemmas |
dc.subject | morphosyntactic tags |
dc.subject | normalized word forms |
dc.subject | word sets |
dc.title | Frequency lists of word-level n-grams from the Gigafida 2.0 corpus |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | wordList |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Jaka Čibej jaka.cibej@cjvt.si Centre for Language Resources and Technologies, University of Ljubljana |
sponsor | ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds |
files.count | 1 |
files.size | 22327366 |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Ime
- GF2.0-word_sets.zip
- Velikost
- 21.29 MB
- Format
- application/zip
- Opis
- Frequency lists of word-level n-grams from Gigafida 2.0 with minimum relative frequency of 2/million
- MD5
- 22e911e80ecfd2cde4458acd74d83b4b
- GF2.0-word_sets-lowercase_forms-morphosyntactic_tags-3grams-taxonomy-collocativity-entire.tsv2 MB
- GF2.0-word_sets-lowercase_forms-3grams-taxonomy-collocativity-entire.tsv2 MB
- GF2.0-word_sets-lowercase_forms-lemmas-2grams-taxonomy-collocativity-entire.tsv10 MB
- GF2.0-word_sets-morphosyntactic_tags-parts_of_speech-4grams_taksonomija-collocativity-entire.tsv12 MB
- GF2.0-word_sets-lowercase_forms-lemmas-5grams-taxonomy-collocativity-entire.tsv44 kB
- GF2.0-word_sets-lowercase_forms-morphosyntactic_tags-2grams-taxonomy-collocativity-entire.tsv9 MB
- GF2.0-word_sets-lowercase_forms-2grams-taxonomy-collocativity-entire.tsv9 MB
- GF2.0-word_sets-lowercase_forms-morphosyntactic_tags-5grams-taxonomy-collocativity-entire.tsv38 kB
- GF2.0-word_sets-morphosyntactic_tags-parts_of_speech-2grams_taksonomija-collocativity-entire.tsv5 MB
- GF2.0-word_sets-lowercase_forms-5grams-taxonomy-collocativity-entire.tsv36 kB
- GF2.0-word_sets-lowercase_forms-lemmas-4grams-taxonomy-collocativity-entire.tsv439 kB
- GF2.0-word_sets-lowercase_forms-morphosyntactic_tags-4grams-taxonomy-collocativity-entire.tsv392 kB
- GF2.0-word_sets-lowercase_forms-4grams-taxonomy-collocativity-entire.tsv383 kB
- GF2.0-word_sets-morphosyntactic_tags-parts_of_speech-3grams_taksonomija-collocativity-entire.tsv14 MB
- GF2.0-word_sets-lowercase_forms-lemmas-3grams-taxonomy-collocativity-entire.tsv3 MB
- GF2.0-word_sets-morphosyntactic_tags-parts_of_speech-5grams_taksonomija-collocativity-entire.tsv2 MB