Show simple item record

 
dc.contributor.author Kosem, Iztok
dc.contributor.author Pori, Eva
dc.contributor.author Arhar Holdt, Špela
dc.date.accessioned 2019-03-08T13:37:07Z
dc.date.available 2019-03-08T13:37:07Z
dc.date.issued 2019-03-08
dc.identifier.uri http://hdl.handle.net/11356/1215
dc.description Wordlists, keywords and n-grams were extracted from a corpus of textbooks for Slovenian elementary and secondary schools. The corpus contains 4,302,857 words (5,373,268 tokens), and consists of 127 textbooks from 16 different subjects: - Biology (6 textbooks; 293,935 words), - State, society and ethics (1 textbook; 21,881 words), - Society (4 textbooks; 64,126), - Physics (5 textbooks; 185,171), - Geography (7 textbooks; 202,101 words), - Music (8 textbooks; 224,034 words), - Home Economics (3 textbooks; 33.803), - Chemistry (7 textbooks; 282,543 words), - Art (3 textbooks; 146,681), - Mathematics (23 textbooks; 764,012), - Science (5 textbooks; 226,191 words), - Science and technology (6 textbooks; 183,749 words), - Slovene language (37 textbooks; 1,437,945 words), - Environmental Education (7 textbooks; 38,645 words), - Technology (1 textbook; 24,733 words) - History (4 textbooks; 173,307 words). The lists were manually cleaned, most items not found in the reference morphological lexicon Sloleks (http://hdl.handle.net/11356/1039) were removed, which mainly consisted of conversion errors. The lists include only those words, keywords or n-grams that were found in at least 8 different subjects. Keyword lists were extracted using the Sketch Engine tool, minimum frequency was set to 5, the statistics used was average relative frequency. Minimum frequency for n-grams was 10.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri https://creativecommons.org/licenses/by/4.0/
dc.rights.label PUB
dc.subject wordlist
dc.subject n-grams
dc.subject textbook corpus
dc.subject keywords
dc.subject vocabulary
dc.subject school
dc.title Keywords and n-grams from a textbook corpus
dc.type lexicalConceptualResource
metashare.ResourceInfo#ContentInfo.detailedType wordList
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Iztok Kosem iztok.kosem@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor University of Ljubljana I0-0022 Network of research and infrastructural centres nationalFunds
size.info 5977 words
size.info 23270 keywords
size.info 9177 n-grams
size.info 7310 bigrams
size.info 1600 trigrams
size.info 184 4-grams
size.info 83 5-grams
files.count 1
files.size 885684


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
Wordlists-Keywords-n-grams_textbook.zip
Size
864.93 KB
Format
application/zip
Description
TSV files + README
MD5
2a402be9b0bc1cba560eb55e24a3cdb4
 Download file  Preview
 File Preview  
  • Wordlists-Keywords-n-grams_textbook
    • keywords-Physics.txt34 kB
    • keywords-Home_Economics.txt11 kB
    • keywords-Science_Technology.txt45 kB
    • keywords-Technology.txt7 kB
    • keywords-Art.txt42 kB
    • keywords-Geography.txt60 kB
    • keywords-Biology.txt61 kB
    • keywords-Chemistry.txt39 kB
    • keywords-Science.txt51 kB
    • keywords-Slovene_Language.txt169 kB
    • keywords-Environmental_Education.txt13 kB
    • keywords-Society.txt21 kB
    • keywords-State_Society_Ethics.txt10 kB
    • keywords-Mathematics.txt43 kB
    • README.txt2 kB
    • keywords-History.txt45 kB
    • keywords-Music.txt47 kB
    • Wordlist-general.txt854 kB
    • Wordlist-by-level.txt779 kB
    • n-grams.txt840 kB

Show simple item record