Frequency lists of word-level n-grams from the Gigafida 2.0 corpus

Name: Frequency lists of word-level n-grams from the Gigafida 2.0 corpus
License: https://creativecommons.org/licenses/by-sa/4.0/

Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon

Prikaži enostavni zapis vnosa

dc.contributor.author	Čibej, Jaka
dc.contributor.author	Arhar Holdt, Špela
dc.contributor.author	Dobrovoljc, Kaja
dc.contributor.author	Krek, Simon
dc.date.accessioned	2019-11-13T09:04:15Z
dc.date.available	2019-11-13T09:04:15Z
dc.date.issued	2019-11-18
dc.identifier.uri	http://hdl.handle.net/11356/1274
dc.description	Frequency lists of word-level n-grams (or word sets) were extracted from the Gigafida 2.0 Corpus of Written Standard Slovene (https://viri.cjvt.si/gigafida/) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams with minimum relative frequency of 2 per million occurring in the corpus, along with their absolute and relative frequencies, percentages, distribution across the text-types included in the corpus taxonomy, and five collocation measures: Dice, t-score, MI, MI3, logDice, and simple LL. The n-grams were extracted from lower-case word forms and morphosyntactic tags. For large lists, shortened versions with the first 150,000 lines were also prepared to facilitate further processing in spreadsheet analysis software.
dc.language.iso	slv
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.publisher	Jožef Stefan Institute
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	http://slovnica.ijs.si/
dc.subject	n-grams
dc.subject	standard language
dc.subject	lemmas
dc.subject	morphosyntactic tags
dc.subject	normalized word forms
dc.subject	word sets
dc.title	Frequency lists of word-level n-grams from the Gigafida 2.0 corpus
dc.type	lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType	wordList
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Jaka Čibej jaka.cibej@cjvt.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor	ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
files.count	1
files.size	22327366

Datoteke v tem vnosu

To je vnos

Publicly Available

z licenco:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Ime: GF2.0-word_sets.zip
Velikost: 21.29 MB
Format: application/zip
Opis: Frequency lists of word-level n-grams from Gigafida 2.0 with minimum relative frequency of 2/million
MD5: 22e911e80ecfd2cde4458acd74d83b4b

Prenesi datoteko Predogled

Predogled datoteke

- GF2.0-word_sets-lowercase_forms-morphosyntactic_tags-3grams-taxonomy-collocativity-entire.tsv2 MB
- GF2.0-word_sets-lowercase_forms-3grams-taxonomy-collocativity-entire.tsv2 MB
- GF2.0-word_sets-lowercase_forms-lemmas-2grams-taxonomy-collocativity-entire.tsv10 MB
- GF2.0-word_sets-morphosyntactic_tags-parts_of_speech-4grams_taksonomija-collocativity-entire.tsv12 MB
- GF2.0-word_sets-lowercase_forms-lemmas-5grams-taxonomy-collocativity-entire.tsv44 kB
- GF2.0-word_sets-lowercase_forms-morphosyntactic_tags-2grams-taxonomy-collocativity-entire.tsv9 MB
- GF2.0-word_sets-lowercase_forms-2grams-taxonomy-collocativity-entire.tsv9 MB
- GF2.0-word_sets-lowercase_forms-morphosyntactic_tags-5grams-taxonomy-collocativity-entire.tsv38 kB
- GF2.0-word_sets-morphosyntactic_tags-parts_of_speech-2grams_taksonomija-collocativity-entire.tsv5 MB
- GF2.0-word_sets-lowercase_forms-5grams-taxonomy-collocativity-entire.tsv36 kB
- GF2.0-word_sets-lowercase_forms-lemmas-4grams-taxonomy-collocativity-entire.tsv439 kB
- GF2.0-word_sets-lowercase_forms-morphosyntactic_tags-4grams-taxonomy-collocativity-entire.tsv392 kB
- GF2.0-word_sets-lowercase_forms-4grams-taxonomy-collocativity-entire.tsv383 kB
- GF2.0-word_sets-morphosyntactic_tags-parts_of_speech-3grams_taksonomija-collocativity-entire.tsv14 MB
- GF2.0-word_sets-lowercase_forms-lemmas-3grams-taxonomy-collocativity-entire.tsv3 MB
- GF2.0-word_sets-morphosyntactic_tags-parts_of_speech-5grams_taksonomija-collocativity-entire.tsv2 MB

Prikaži enostavni zapis vnosa

Datoteke v tem vnosu

Partnerji

Partnerji

Repozitorij