dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Rozman, Tadeja |
dc.contributor.author | Stritar Kučuk, Mojca |
dc.contributor.author | Krek, Simon |
dc.contributor.author | Krapš Vodopivec, Irena |
dc.contributor.author | Stabej, Marko |
dc.contributor.author | Pori, Eva |
dc.contributor.author | Goli, Teja |
dc.contributor.author | Lavrič, Polona |
dc.contributor.author | Laskowski, Cyprian |
dc.contributor.author | Kocjančič, Polonca |
dc.contributor.author | Klemenc, Bojan |
dc.contributor.author | Krsnik, Luka |
dc.contributor.author | Kosem, Iztok |
dc.date.accessioned | 2022-09-07T07:30:14Z |
dc.date.available | 2022-09-07T07:30:14Z |
dc.date.issued | 2022-09-05 |
dc.identifier.uri | http://hdl.handle.net/11356/1589 |
dc.description | The Developmental corpus Šolar consists of 5,485 texts written by students in Slovenian secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school (13-15), with a small percentage also from the 6th grade. The information on school (elementary or secondary), subject, level (grade or year), type of text, region, and date of production is provided for each text. School essays form the majority of the corpus while other material includes texts created during lessons, such as text recapitulations or descriptions, examples of formal applications, etc. Part of the corpus (2,094 texts) is annotated with teachers' corrections using a system of labels described in the attached document (in Slovenian). Teacher corrections were part of the original files and reflect real classroom situations of essay marking. Corrections were then inserted into texts by annotators and subsequently categorized. The corpus was linguistically annotated with the CLASSLA v1.1.1 pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The corpus is available in TEI format, where the original and corrected versions of the texts are encoded separately, while intertextual links with error labels give the relations between the two. Additionally, the corpus is available also in the CoNLL-U and JSON formats, as well as vertical files for use with Sketch Engine type concordancers. As opposed to the previous version 2.0, which was also available in two separate versions, i.e. Šolar Clear 2.0 (http://hdl.handle.net/11356/1219), with the students' text without teacher corrections, and Šolar Error (http://hdl.handle.net/11356/1231), with only those sentences that have teacher corrections, the current version has a different encoding, error annotations were manually edited in cca. 350 texts, and the linguistic annotation was performed with a better tool. |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.relation.replaces | http://hdl.handle.net/11356/1214 |
dc.relation.replaces | http://hdl.handle.net/11356/1231 |
dc.relation.replaces | http://hdl.handle.net/11356/1219 |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://rsdo.slovenscina.eu/jezikovni-viri |
dc.subject | developmental corpus |
dc.subject | error annotation |
dc.subject | student writing |
dc.title | Developmental corpus Šolar 3.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Iztok Kosem iztok.kosem@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana |
sponsor | Ministry of Culture 3340-15-141006 Upgrade of Šolar Corpus nationalFunds |
sponsor | ARRS (Slovenian Research Agency) I0-0051 Centre for Applied Linguistics (CUJ) nationalFunds |
sponsor | Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other |
sponsor | University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other |
sponsor | ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds |
size.info | 5485 texts |
size.info | 1635407 words |
files.count | 4 |
files.size | 204057052 |
featuredService.kontext | search original|https://www.clarin.si/kontext/query?corpname=solar30_orig |
featuredService.kontext | search corrected|https://www.clarin.si/kontext/query?corpname=solar30_corr |
featuredService.noske | search original|https://www.clarin.si/ske/#dashboard?corpname=solar30_orig&struct_attr_stats=1 |
featuredService.noske | search corrected|https://www.clarin.si/ske/#dashboard?corpname=solar30_corr&struct_attr_stats=1 |
Files in this item
Download all files in item (194.6 MB)This item is
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)





- Name
- Solar.TEI.zip
- Size
- 94.69 MB
- Format
- application/zip
- Description
- Corpus in TEI format
- MD5
- 661f3ab8de4d3c33a5c3ea33a81f8367

- Name
- Solar.CoNLL-U.zip
- Size
- 45.41 MB
- Format
- application/zip
- Description
- Corpus in CoNLL-U + JSON format
- MD5
- ef4f3b9191395b2d4a186804af465c4a
- Solar.CoNLL-U
- solar-orig.conllu131 MB
- solar-errs.json340 MB
- solar-corr.conllu131 MB
- solar-meta.tsv675 kB
- 00README.txt772 B

- Name
- Solar.vert.zip
- Size
- 53.67 MB
- Format
- application/zip
- Description
- Corpus in derived vertical format with registry files
- MD5
- 232a84e78aec63772eebd3d4e3cf730f
- Solar.vert
- solar30_orig.vert263 MB
- solar30_orig.regi3 kB
- solar30_corr.vert263 MB
- solar30_corr.regi3 kB
- 00README.txt774 B

- Name
- Smernice-za-oznacevanje-korpusa-Solar_V1.1.pdf
- Size
- 856.9 KB
- Format
- Description
- Error annotation guidelines (in Slovenian)
- MD5
- c8b8b68fd1be51e1edadb7dd249b3ab4