dc.contributor.author | Dobrovoljc, Kaja |
dc.date.accessioned | 2018-08-03T18:46:37Z |
dc.date.available | 2018-08-03T18:46:37Z |
dc.date.issued | 2018-08-03 |
dc.identifier.uri | http://hdl.handle.net/11356/1195 |
dc.description | A collection of n-grams extracted from the Gos corpus of spoken Slovene (cf. http://eng.slovenscina.eu/korpusi/gos). Three sets of n-gram lists are provided for lowercased word n-grams of length 1 to 5: - extensive frequency lists of all extracted n-grams - filtered frequency lists of n-grams with minimum frequency 10/mil. - adjusted frequency list of all n-grams with minimum frequency 10/mil. Only n-grams within sentences have been counted, ignoring punctuation. For the filtered and adjusted list, only n-grams occurring in at least 2 different texts have been extracted. Key references: - K. Dobrovoljc, 2018. N-gram frequency lists for reference corpora of Slovenian language. Proceedings of the Language Technologies & Digital Humanities Conference 2018. - D. Verdonik, I. Kosem, A. Zwitter Vitez, S. Krek, M. Stabej, 2013. Compilation, transcription and usage of a reference speech corpus: The case of the Slovene corpus GOS. Language resources and evaluation, 47 (4), pp. 1031-1048, doi: 10.1007/s10579-013-9216-5. - M. B. O’Donnell, 2010. The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal 35, 135–169. |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.relation.isreferencedby | http://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Dobrovoljc-K_Frekvencni-seznami-n-gramov-v-korpusih-slovenskega-jezika.pdf |
dc.relation.replaces | http://hdl.handle.net/11356/1046 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.subject | n-grams |
dc.subject | wordlist |
dc.subject | multiword expressions |
dc.subject | spoken corpus |
dc.title | Gos corpus n-grams 2.0 |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | wordList |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Kaja Dobrovoljc kaja.dobrovoljc@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) MR-36491 Young Researcher Programme nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds |
size.info | 62710 unigrams |
size.info | 394416 bigrams |
size.info | 692260 trigrams |
size.info | 750559 4-grams |
size.info | 698208 5-grams |
size.info | 2598153 n-grams |
files.count | 3 |
files.size | 22035855 |
Files in this item
Download all files in item (21.02 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- all_1-5-grams_Gos.zip
- Size
- 20.83 MB
- Format
- application/zip
- Description
- Collection of all n-grams.
- MD5
- 48e9597d1a39f29e2c65f38ef236c41a

- Name
- filtered_1-5-grams_Gos.zip
- Size
- 101.68 KB
- Format
- application/zip
- Description
- Collection of n-grams above frequency 10/mil.
- MD5
- b000a8afda2b5181d6c6bfbc0ff55206

- Name
- adjusted_1-5-grams_Gos.zip
- Size
- 91.73 KB
- Format
- application/zip
- Description
- List of all n-grams with adjusted frequency above 10/mil.
- MD5
- 2c5fe866fde6ff1cbc3cb76255d93039