Show simple item record

 
dc.contributor.author Krek, Simon
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Može, Sara
dc.contributor.author Ledinek, Nina
dc.contributor.author Holz, Nanika
dc.date.accessioned 2015-05-17T19:14:37Z
dc.date.available 2015-05-17T19:14:37Z
dc.date.issued 2013-09-30
dc.identifier.uri http://hdl.handle.net/11356/1029
dc.description The ssj500k training corpus is based on two training corpora built within the JOS project (http://nl.ijs.si/jos/). It contains the jos100k corpus and additional material from the jos1M corpus forming a training corpus with 500,000 words, manually checked and annotated on the levels of tokenization, segmentation, morphosyntactic tagging, syntactic dependency parsing and named entities. The ssj500k corpus uses the JOS morphosyntactic tagset with 1,902 tags and dependencies with 10 labels. The part of the corpus annotated with dependency relations contains 11,411 sentences, named entities are annotated in the original jos100k part of the corpus.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreplacedby http://hdl.handle.net/11356/1052
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.source.uri http://eng.slovenscina.eu/tehnologije/ucni-korpus
dc.subject tagging
dc.subject dependency treebank
dc.subject parsing
dc.subject named entities
dc.subject tokenisation
dc.subject manual annotation
dc.subject TEI
dc.title Training corpus ssj500k 1.3
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hidden hidden
has.files yes
branding CLARIN.SI data & tools
contact.person Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute
sponsor Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other
size.info 500295 words
size.info 586248 tokens
files.count 3
files.size 18564568


 Files in this item

 Download all files in item (17.7 MB)
Icon
Name
ssj500kv1_3.zip
Size
7.81 MB
Format
application/zip
Description
Corpus encoded in TEI-like format with annotations in Slovenian
MD5
9a29d9b0f97f521249c5cf6f0990426d
 Download file  Preview
 File Preview  
    • ssj500kv1_3.xml84 MB
Icon
Name
ssj500kv1_3-en.tei.zip
Size
7.81 MB
Format
application/zip
Description
Corpus encoded in TEI-like format with annotations in English
MD5
b38204a484d114a2499581c3cce3a3e1
 Download file  Preview
 File Preview  
    • ssj500kv1_3-en.tei.xml85 MB
Icon
Name
ssj500kv1_3-sl.conll.zip
Size
2.08 MB
Format
application/zip
Description
Corpus encoded in CoNLL-X format
MD5
ae86a11faa2c51a844e47479838c3d25
 Download file  Preview
 File Preview  
    • ssj500kv1_3-sl.conll.tbl17 MB

Show simple item record