dc.contributor.author | Kosem, Iztok |
dc.contributor.author | Pori, Eva |
dc.contributor.author | Arhar Holdt, Špela |
dc.date.accessioned | 2023-03-01T12:10:03Z |
dc.date.available | 2023-03-01T12:10:03Z |
dc.date.issued | 2023-02-28 |
dc.identifier.uri | http://hdl.handle.net/11356/1719 |
dc.description | The dataset contains a list of 11906 words (lemmas with part of speech information) and their frequency of occurrence in a corpus of Slovenian textobooks, covering elementary school (Grade 1 to 9) and secondary school (Year 1 to 4). The corpus contains 4,302,857 words (5,373,268 tokens), and consists of 127 textbooks from 16 different subjects. The distribution per school level is as follows: - Grade 1: 17949 tokens - Grade 2: 46317 tokens - Grade 3: 84222 tokens - Grade 4: 305454 tokens - Grade 5: 357400 tokens - Grade 6: 351463 tokens - Grade 7: 537359 tokens - Grade 8: 592068 tokens - Grade 9: 765574 tokens - Year 1: 665093 tokens - Year 2: 200267 tokens - Year 3: 149442 tokens - Year 4: 23406 tokens - Year 1-4: 206843 tokens (these are textbooks that are used in all the years of secondary school and were not divided according to different years) The purpose of the dataset is to facilitate research into vocabularly use at different levels of education, and to enable comparative studies of student language reception and production in Slovene. |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.cjvt.si/prop/en/ |
dc.subject | textbook corpus |
dc.subject | vocabulary |
dc.subject | diachronic |
dc.subject | school |
dc.subject | language didactics |
dc.title | Frequency list of textbook vocabulary by level of education in elementary and secondary schools |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | wordList |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Iztok Kosem iztok.kosem@ff.uni-lj.si Faculty of Arts, University of Ljubljana |
sponsor | ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
size.info | 11906 words |
files.count | 2 |
files.size | 1447177 |
Files in this item
Download all files in item (1.38 MB)This item is
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)





- Name
- frequency-list-from-textbook-corpus-diachronic.txt
- Size
- 1.38 MB
- Format
- Text file
- Description
- Frequency list in text format
- MD5
- 9fac168c226aee97d6e1e0251f528c5a
Lema Lema (male črke) Besedna vrsta Skupna absolutna pogostost leme Skupna relativna pogostost (na milijon pojavitev) Absolutna pogostost (1. razred) Relativna pogostost (1. razred) Absolutna pogostost (2. razred) Relativna pogostost (2. razred) Absolutna pogostost (3. razred) Relativna pogostost (3. razred) Absolutna pogostost (4. razred) Relativna pogostost (4. razred) Absolutna pogostost (5. razred) Relativna pogostost (5. razred) Absolutna pogostost (6. razred) Relativna pogostost (6. razred) Absolutna pogostost (7. razred) Relativna pogostost (7. razred) Absolutna pogostost (8. razred) Relativna pogostost (8. razred) Absolutna pogostost (9. razred) Relativna pogostost (9. razred) Absolutna pogostost (1. letnik) Relativna pogostost (1. letnik) Absolutna pogostost (2. letnik) Relativna pogostost (2. letnik) Absolutna pogostost (3. letnik) Relativna pogostost (3. letnik) Absolutna pogostost (4. letnik) Relativna pogostost (4. letnik) Absolutna pogostost (1.-4. letnik) Relativna pogos . . .

- Name
- README.txt
- Size
- 1.84 KB
- Format
- Text file
- Description
- README file
- MD5
- 8df9248e571f9ce83993f73ca8175d94
*************** SLO: Podatkovni niz vsebuje seznam 11.906 besed (s podatkom o besedni vrsti) in njihove pogostosti v učbeniškem korpusu, ki vsebuje učbenike iz osnovne šole (od 1. do 9. razreda) in srednje šole (od 1. do 4. letnika). ENG: The dataset contains a list of 11906 words (lemmas with part of speech information) and their frequency of occurrence in a corpus of Slovenian textbooks, covering elementary school (Grade 1 to 9) and secondary school (Year 1 to 4). Kosem, Iztok; Pori, Eva; Arhar Holdt, Špela, 2023, Frequency list of textbook vocabulary by level of education in elementary and secondary schools, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1719. *************** "Lema": SLO: Lema besede iz učbeniškega korpusa. ENG: Lemma of the word from the textbook corpus. "Lema (male črke)": SLO: Lema besede z malimi črkami. ENG: Lemma of the word in lower case. "Besedna vrsta": SLO: Podatek o besedni vrsti besede (po . . .