The reference list of Slovene most frequent common words was prepared by selecting vocabulary at the intersection of the most frequent 10,000 lemmas of four Slovene text corpora: the balanced reference corpus of written Slovene Kres, the reference corpus of spoken Slovene GOS, the corpus of computer-mediated communication Janes and the corpus of school written production Šolar 2.0. The list was additionally manually cleaned and contains 4,768 common general lemmas. The file is in a tab separated format, containing lemma, part-of-speech (following the MULTEXT-East tagset for Slovene), relative average reduced frequency in each of the corpora, and the final average score computed from these values.
The dataset is described in more detail in: Špela Arhar Holdt, Senja Pollak, Marko Robnik Šikonja, Simon Krek (2020). Referenčni seznam pogostih splošnih besed za slovenščino. In the Proceedings of the Conference on Language Technologies and Digital Humanities, pp. 10-15.
ARRS (Slovenian Research Agency)P6-0411"Language Resources and Technologies for Slovene"ARRS (Slovenian Research Agency)P2-103"Knowledge Technologies"European UnionEC/H2020/825153"EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media"Ministry of Education, Science and Sport3330-17-1748"KAUČ - Improving the Quality of Slovene Textbooks/Za kakovost slovenskih učbenikov"