Show simple item record

 
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Rozman, Tadeja
dc.contributor.author Stritar Kučuk, Mojca
dc.contributor.author Krek, Simon
dc.contributor.author Krapš Vodopivec, Irena
dc.contributor.author Stabej, Marko
dc.contributor.author Pori, Eva
dc.contributor.author Goli, Teja
dc.contributor.author Lavrič, Polona
dc.contributor.author Laskowski, Cyprian
dc.contributor.author Kocjančič, Polonca
dc.contributor.author Klemenc, Bojan
dc.contributor.author Krsnik, Luka
dc.contributor.author Kosem, Iztok
dc.date.accessioned 2022-09-07T07:30:14Z
dc.date.available 2022-09-07T07:30:14Z
dc.date.issued 2022-09-05
dc.identifier.uri http://hdl.handle.net/11356/1589
dc.description The Developmental corpus Šolar consists of 5,485 texts written by students in Slovenian secondary schools (age 15-19) and pupils in the 7th-9th grade of primary school (13-15), with a small percentage also from the 6th grade. The information on school (elementary or secondary), subject, level (grade or year), type of text, region, and date of production is provided for each text. School essays form the majority of the corpus while other material includes texts created during lessons, such as text recapitulations or descriptions, examples of formal applications, etc. Part of the corpus (2,094 texts) is annotated with teachers' corrections using a system of labels described in the attached document (in Slovenian). Teacher corrections were part of the original files and reflect real classroom situations of essay marking. Corrections were then inserted into texts by annotators and subsequently categorized. The corpus was linguistically annotated with the CLASSLA v1.1.1 pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The corpus is available in TEI format, where the original and corrected versions of the texts are encoded separately, while intertextual links with error labels give the relations between the two. Additionally, the corpus is available also in the CoNLL-U and JSON formats, as well as vertical files for use with Sketch Engine type concordancers. As opposed to the previous version 2.0, which was also available in two separate versions, i.e. Šolar Clear 2.0 (http://hdl.handle.net/11356/1219), with the students' text without teacher corrections, and Šolar Error (http://hdl.handle.net/11356/1231), with only those sentences that have teacher corrections, the current version has a different encoding, error annotations were manually edited in cca. 350 texts, and the linguistic annotation was performed with a better tool.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.replaces http://hdl.handle.net/11356/1214
dc.relation.replaces http://hdl.handle.net/11356/1231
dc.relation.replaces http://hdl.handle.net/11356/1219
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.source.uri https://rsdo.slovenscina.eu/jezikovni-viri
dc.subject developmental corpus
dc.subject error annotation
dc.subject student writing
dc.title Developmental corpus Šolar 3.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Iztok Kosem iztok.kosem@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor Ministry of Culture 3340-15-141006 Upgrade of Šolar Corpus nationalFunds
sponsor ARRS (Slovenian Research Agency) I0-0051 Centre for Applied Linguistics (CUJ) nationalFunds
sponsor Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other
sponsor University of Ljubljana I0-0022 Network of Research Infrastructure Centres (MRIC) nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
sponsor ARRS J7-3159 Empirical foundations for digitally-supported development of writing skills nationalFunds
size.info 5485 texts
size.info 1635407 words
files.count 4
files.size 204057052
featuredService.kontext search original|https://www.clarin.si/kontext/query?corpname=solar30_orig
featuredService.kontext search corrected|https://www.clarin.si/kontext/query?corpname=solar30_corr
featuredService.noske search original|https://www.clarin.si/ske/#dashboard?corpname=solar30_orig&struct_attr_stats=1
featuredService.noske search corrected|https://www.clarin.si/ske/#dashboard?corpname=solar30_corr&struct_attr_stats=1


 Files in this item

 Download all files in item (194.6 MB)
Icon
Name
Solar.TEI.zip
Size
94.69 MB
Format
application/zip
Description
Corpus in TEI format
MD5
661f3ab8de4d3c33a5c3ea33a81f8367
 Download file  Preview
 File Preview  
  • Solar.TEI
    • solar.xml51 kB
    • solar-corr.xml523 MB
    • solar-orig.xml522 MB
    • solar-errs.xml143 MB
    • schema
      • tei_clarin_example.xml48 kB
      • tei_clarin.rnc311 kB
      • tei_clarin_schema.xml70 kB
      • README.md525 B
      • tei_clarin.rng654 kB
    • 00README.txt971 B
Icon
Name
Solar.CoNLL-U.zip
Size
45.41 MB
Format
application/zip
Description
Corpus in CoNLL-U + JSON format
MD5
ef4f3b9191395b2d4a186804af465c4a
 Download file  Preview
 File Preview  
  • Solar.CoNLL-U
    • solar-orig.conllu131 MB
    • solar-errs.json340 MB
    • solar-corr.conllu131 MB
    • solar-meta.tsv675 kB
    • 00README.txt772 B
Icon
Name
Solar.vert.zip
Size
53.67 MB
Format
application/zip
Description
Corpus in derived vertical format with registry files
MD5
232a84e78aec63772eebd3d4e3cf730f
 Download file  Preview
 File Preview  
  • Solar.vert
    • solar30_orig.vert263 MB
    • solar30_orig.regi3 kB
    • solar30_corr.vert263 MB
    • solar30_corr.regi3 kB
    • 00README.txt774 B
Icon
Name
Smernice-za-oznacevanje-korpusa-Solar_V1.1.pdf
Size
856.9 KB
Format
PDF
Description
Error annotation guidelines (in Slovenian)
MD5
c8b8b68fd1be51e1edadb7dd249b3ab4
 Download file

Show simple item record