dc.contributor.author | Kosem, Iztok |
dc.contributor.author | Pori, Eva |
dc.contributor.author | Arhar Holdt, Špela |
dc.date.accessioned | 2019-03-08T13:37:07Z |
dc.date.available | 2019-03-08T13:37:07Z |
dc.date.issued | 2019-03-08 |
dc.identifier.uri | http://hdl.handle.net/11356/1215 |
dc.description | Wordlists, keywords and n-grams were extracted from a corpus of textbooks for Slovenian elementary and secondary schools. The corpus contains 4,302,857 words (5,373,268 tokens), and consists of 127 textbooks from 16 different subjects: - Biology (6 textbooks; 293,935 words), - State, society and ethics (1 textbook; 21,881 words), - Society (4 textbooks; 64,126), - Physics (5 textbooks; 185,171), - Geography (7 textbooks; 202,101 words), - Music (8 textbooks; 224,034 words), - Home Economics (3 textbooks; 33.803), - Chemistry (7 textbooks; 282,543 words), - Art (3 textbooks; 146,681), - Mathematics (23 textbooks; 764,012), - Science (5 textbooks; 226,191 words), - Science and technology (6 textbooks; 183,749 words), - Slovene language (37 textbooks; 1,437,945 words), - Environmental Education (7 textbooks; 38,645 words), - Technology (1 textbook; 24,733 words) - History (4 textbooks; 173,307 words). The lists were manually cleaned, most items not found in the reference morphological lexicon Sloleks (http://hdl.handle.net/11356/1039) were removed, which mainly consisted of conversion errors. The lists include only those words, keywords or n-grams that were found in at least 8 different subjects. Keyword lists were extracted using the Sketch Engine tool, minimum frequency was set to 5, the statistics used was average relative frequency. Minimum frequency for n-grams was 10. |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ |
dc.rights.label | PUB |
dc.subject | wordlist |
dc.subject | n-grams |
dc.subject | textbook corpus |
dc.subject | keywords |
dc.subject | vocabulary |
dc.subject | school |
dc.title | Keywords and n-grams from a textbook corpus |
dc.type | lexicalConceptualResource |
metashare.ResourceInfo#ContentInfo.detailedType | wordList |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Iztok Kosem iztok.kosem@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | University of Ljubljana I0-0022 Network of research and infrastructural centres nationalFunds |
size.info | 5977 words |
size.info | 23270 keywords |
size.info | 9177 n-grams |
size.info | 7310 bigrams |
size.info | 1600 trigrams |
size.info | 184 4-grams |
size.info | 83 5-grams |
files.count | 1 |
files.size | 885684 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)



- Name
- Wordlists-Keywords-n-grams_textbook.zip
- Size
- 864.93 KB
- Format
- application/zip
- Description
- TSV files + README
- MD5
- 2a402be9b0bc1cba560eb55e24a3cdb4
- Wordlists-Keywords-n-grams_textbook
- keywords-Physics.txt34 kB
- keywords-Home_Economics.txt11 kB
- keywords-Science_Technology.txt45 kB
- keywords-Technology.txt7 kB
- keywords-Art.txt42 kB
- keywords-Geography.txt60 kB
- keywords-Biology.txt61 kB
- keywords-Chemistry.txt39 kB
- keywords-Science.txt51 kB
- keywords-Slovene_Language.txt169 kB
- keywords-Environmental_Education.txt13 kB
- keywords-Society.txt21 kB
- keywords-State_Society_Ethics.txt10 kB
- keywords-Mathematics.txt43 kB
- README.txt2 kB
- keywords-History.txt45 kB
- keywords-Music.txt47 kB
- Wordlist-general.txt854 kB
- Wordlist-by-level.txt779 kB
- n-grams.txt840 kB