dc.contributor.author | Kosem, Iztok |
dc.contributor.author | Pori, Eva |
dc.contributor.author | Žagar, Aleš |
dc.contributor.author | Arhar Holdt, Špela |
dc.date.accessioned | 2022-10-31T09:32:57Z |
dc.date.available | 2022-10-31T09:32:57Z |
dc.date.issued | 2022-10-13 |
dc.identifier.uri | http://hdl.handle.net/11356/1693 |
dc.description | ccUčbeniki includes 32 openly available texbooks for Slovenian primary and secondary education, published by the Slovenian National Education Institute in 2014-2015. The textbooks, prepared by various authors, cover different subjects as is documented in the ccucbeniki-metadata file. The corpus was linguistically annotated with the CLASSLA v1.1.1 pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The idea is to provide comparably annotated pedagogically-relevant corpora that can be used for different tasks in the field of language didactics and NLP. The corpus is available in CoNLL-U and vertical formats. The CoNLL-U format contains one document per file (and separately text metadata as a TSV file) and the vertical format contains concatenated documents in one large file. The registry file ccucbeniki.regi for the vertical format is compatible with the LIST 1.2 corpus extraction tool (http://hdl.handle.net/11356/1276). |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://www.cjvt.si/prop/ |
dc.subject | textbook corpus |
dc.subject | student reading |
dc.subject | language didactics |
dc.title | Corpus of Slovenian textbooks ccUčbeniki 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Špela Arhar Holdt Spela.ArharHoldt@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana |
sponsor | ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | University of Ljubljana I0-0022 Network of research and infrastructural centres nationalFunds |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
size.info | 32 texts |
size.info | 2181602 tokens |
files.count | 2 |
files.size | 47062549 |
Datoteke v tem vnosu
Prenesi vse datoteke v vnosu (44.88 MB)To je vnos
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)





- Ime
- ccucbeniki.conllu.zip
- Velikost
- 21.6 MB
- Format
- application/zip
- Opis
- Corpus in CoNLL-U format with TSV metadata
- MD5
- d3eda8559a3d584074ec62c584eb6aa0
- ccucbeniki.conllu
- ucb6.conllu4 MB
- ucb1.conllu4 MB
- ucb32.conllu2 MB
- ucb30.conllu3 MB
- ucb26.conllu19 MB
- ucb21.conllu1 MB
- ucb29.conllu2 MB
- ucb24.conllu5 MB
- ucb15.conllu2 MB
- ucb10.conllu5 MB
- ucb18.conllu13 MB
- ucb13.conllu3 MB
- ucb9.conllu2 MB
- ucb4.conllu11 MB
- ucb7.conllu3 MB
- ucb2.conllu3 MB
- ucb5.conllu2 MB
- ucb31.conllu4 MB
- ucb27.conllu3 MB
- ucb22.conllu3 MB
- ucb25.conllu4 MB
- ucb20.conllu3 MB
- ucb16.conllu10 MB
- ucb11.conllu4 MB
- ucb28.conllu5 MB
- ucb19.conllu5 MB
- ucb14.conllu2 MB
- ucb23.conllu5 MB
- ucb17.conllu4 MB
- ucb12.conllu2 MB
- ucb8.conllu2 MB
- ucb3.conllu8 MB
- ccucbeniki-metadata.tsv5 kB

- Ime
- ccucbeniki.vert.zip
- Velikost
- 23.28 MB
- Format
- application/zip
- Opis
- Corpus in vertical format with LIST-type registry file
- MD5
- bd37fd936ad8cd093eadd8e2c9be372d
- ccucbeniki.vert
- ccucbeniki.vert186 MB
- README.txt1 kB
- ccucbeniki.regi135 B