Frequency lists of words from the GOS 1.0 corpus

Name: Frequency lists of words from the GOS 1.0 corpus
License: https://creativecommons.org/licenses/by-sa/4.0/

Čibej, Jaka; Arhar Holdt, Špela; Dobrovoljc, Kaja; Krek, Simon

dc.contributor.author	Čibej, Jaka
dc.contributor.author	Arhar Holdt, Špela
dc.contributor.author	Dobrovoljc, Kaja
dc.contributor.author	Krek, Simon
dc.date.accessioned	2019-11-13T08:50:19Z
dc.date.available	2019-11-13T08:50:19Z
dc.date.issued	2019-11-18
dc.identifier.uri	http://hdl.handle.net/11356/1269
dc.description	Frequency lists of words were extracted from the GOS 1.0 Corpus of Spoken Slovene (http://hdl.handle.net/11356/1040) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all words occurring in the corpus along with their absolute and relative frequencies, percentages, and distribution across the text-types included in the corpus taxonomy. The lists were extracted for each part-of-speech category. For each part-of-speech, two lists were extracted: 1) one containing lemmas and their text-type distribution, 2) one containing lower-case word forms as well as their normalized forms, lemmas, and morphosyntactic tags along with their text-type distribution. In addition, four lists were extracted from all words (regardless of their part-of-speech category): 1) a list of all lemmas along with their part-of-speech category and text-type distribution; 2) a list of all lower-case word forms with their lemmas, part-of-speech categories, and text-type distribution; 3) a list of all lower-case word forms with their normalized word forms, lemmas, part-of-speech categories, and text-type distribution; 4) a list of all morphosyntactic tags and their text-type distribution (the tags are also split into several columns).
dc.language.iso	slv
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.publisher	Jožef Stefan Institute
dc.relation.isreplacedby	http://hdl.handle.net/11356/1364
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	http://slovnica.ijs.si/
dc.subject	frequency list
dc.subject	spoken corpus
dc.subject	words
dc.subject	lemmas
dc.subject	normalized forms
dc.title	Frequency lists of words from the GOS 1.0 corpus
dc.type	lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType	wordList
metashare.ResourceInfo#ContentInfo.mediaType	text
hidden	hidden
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Jaka Čibej jaka.cibej@cjvt.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor	ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
sponsor	Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other
files.count	1
files.size	4717179

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: GOS1.0-words.zip
Size: 4.5 MB
Format: application/zip
Description: Frequency lists of words in GOS1.0
MD5: eac2a3ff4a60fc7d26625591db22bfee

Download file Preview

File Preview

GOS1.0-words-verbs
- GOS1.0-words-verbs-lemmas-taxonomy-entire.tsv2 MB
- GOS1.0-words-verbs-lowercase_forms-normalized_forms-lemmas-morphosyntactic_tags-taxonomy-entire.tsv3 MB
GOS1.0-words-prepositions
- GOS1.0-words-prepositions-lowercase_forms-normalized_forms-lemmas-morphosyntactic_tags-taxonomy-entire.tsv62 kB
- GOS1.0-words-prepositions-lemmas-taxonomy-entire.tsv16 kB
GOS1.0-words-interjections
- GOS1.0-words-interjections-lemmas-taxonomy-entire.tsv12 kB
- GOS1.0-words-interjections-lowercase_forms-normalized_forms-lemmas-morphosyntactic_tags-taxonomy-entire.tsv20 kB
GOS1.0-words-particles
- GOS1.0-words-particles-lowercase_forms-normalized_forms-lemmas-morphosyntactic_tags-taxonomy-entire.tsv63 kB
- GOS1.0-words-particles-lemmas-taxonomy-entire.tsv16 kB
GOS1.0-words-all
- GOS1.0-words-all-morphosyntactic_tags-split_MSD-taxonomy-entire.tsv172 kB
- GOS1.0-words-all-lowercase_forms-lemmas-parts_of_speech-taxonomy-entire.tsv9 MB
- GOS1.0-words-all-lemmas-parts_of_speech-taxonomy-entire.tsv3 MB
- GOS1.0-words-all-lowercase_forms-normalized_forms-lemmas-parts_of_speech-taxonomy-entire.tsv10 MB
GOS1.0-words-adjectives
- GOS1.0-words-adjectives-lowercase_forms-normalized_forms-lemmas-morphosyntactic_tags-taxonomy-entire.tsv3 MB
- GOS1.0-words-adjectives-lemmas-taxonomy-entire.tsv1 MB
GOS1.0-words-numerals
- GOS1.0-words-numerals-lemmas-taxonomy-entire.tsv92 kB
- GOS1.0-words-numerals-lowercase_forms-normalized_forms-lemmas-morphosyntactic_tags-taxonomy-entire.tsv371 kB
GOS1.0-words-nouns
- GOS1.0-words-nouns-lowercase_forms-normalized_forms-lemmas-morphosyntactic_tags-taxonomy-entire.tsv5 MB
- GOS1.0-words-nouns-lemmas-taxonomy-entire.tsv3 MB
GOS1.0-words-pronouns
- GOS1.0-words-pronouns-lemmas-taxonomy-entire.tsv84 kB
- GOS1.0-words-pronouns-lowercase_forms-normalized_forms-lemmas-morphosyntactic_tags-taxonomy-entire.tsv587 kB
GOS1.0-words-residual
- GOS1.0-words-residual-lemmas-taxonomy-entire.tsv1 MB
- GOS1.0-words-residual-lowercase_forms-normalized_forms-lemmas-morphosyntactic_tags-taxonomy-entire.tsv1 MB
GOS1.0-words-adverbs
- GOS1.0-words-adverbs-lowercase_forms-normalized_forms-lemmas-morphosyntactic_tags-taxonomy-entire.tsv581 kB
- GOS1.0-words-adverbs-lemmas-taxonomy-entire.tsv280 kB
GOS1.0-words-abbreviations
- GOS1.0-words-abbreviations-lowercase_forms-normalized_forms-lemmas-morphosyntactic_tags-taxonomy-entire.tsv4 kB
- GOS1.0-words-abbreviations-lemmas-taxonomy-entire.tsv4 kB
GOS1.0-words-conjunctions
- GOS1.0-words-conjunctions-lemmas-taxonomy-entire.tsv12 kB
- GOS1.0-words-conjunctions-lowercase_forms-normalized_forms-lemmas-morphosyntactic_tags-taxonomy-entire.tsv73 kB

Show simple item record

Files in this item

Partners

Partners

Repository