Show simple item record Pollak, Senja Arhar Holdt, Špela Krek, Simon Robnik-Šikonja, Marko 2020-09-24T15:32:42Z 2020-09-24T15:32:42Z 2020-09-10
dc.description The reference list of Slovene most frequent common words was prepared by selecting vocabulary at the intersection of the most frequent 10,000 lemmas of four Slovene text corpora: the balanced reference corpus of written Slovene Kres, the reference corpus of spoken Slovene GOS, the corpus of computer-mediated communication Janes and the corpus of school written production Šolar 2.0. The list was additionally manually cleaned and contains 4,768 common general lemmas. The file is in a tab separated format, containing lemma, part-of-speech (following the MULTEXT-East tagset for Slovene), relative average reduced frequency in each of the corpora, and the final average score computed from these values. The dataset is described in more detail in: Špela Arhar Holdt, Senja Pollak, Marko Robnik Šikonja, Simon Krek (2020). Referenčni seznam pogostih splošnih besed za slovenščino. In the Proceedings of the Conference on Language Technologies and Digital Humanities, pp. 10-15.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation info:eu-repo/grantAgreement/EC/H2020/825153
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.label PUB
dc.subject common words
dc.subject frequent words
dc.subject reference corpora
dc.subject readability
dc.title Reference List of Slovene Frequent Common Words
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Špela Arhar Holdt Centre for Language Resources and Technologies, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
sponsor Ministry of Education, Science and Sport 3330-17-1748 KAUČ - Improving the Quality of Slovene Textbooks/Za kakovost slovenskih učbenikov nationalFunds 4768 entries
files.count 1
files.size 71183

 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
69.51 KB
Slovene frequent common words in TSV format
 Download file  Preview
 File Preview  
    • SloveneFrequentCommonWords.txt480 kB

Show simple item record