Show simple item record

 
dc.contributor.author Krek, Simon
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Može, Sara
dc.contributor.author Ledinek, Nina
dc.contributor.author Holz, Nanika
dc.date.accessioned 2016-02-13T13:44:11Z
dc.date.available 2016-02-13T13:44:11Z
dc.date.issued 2015-10-26
dc.identifier.uri http://hdl.handle.net/11356/1052
dc.description The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization, sentence segmentation, morphosyntactic tagging, lemmatisation, named entities, and, partially, syntactic dependencies. The ssj500k corpus uses the MULTEXT-East / JOS morphosyntactic tagset and the JOS dependency schema and is based on the jos100k and jos1M corpora. Note that this entry updates ssj500k 1.3 by fixing many annotation errors.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.replaces http://hdl.handle.net/11356/1029
dc.relation.isreplacedby http://hdl.handle.net/11356/1165
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.source.uri http://eng.slovenscina.eu/tehnologije/ucni-korpus
dc.subject tagging
dc.subject dependency treebank
dc.subject parsing
dc.subject named entities
dc.subject tokenisation
dc.subject manual annotation
dc.subject TEI
dc.title Training corpus ssj500k 1.4
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hidden hidden
has.files yes
branding CLARIN.SI data & tools
contact.person Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute
sponsor Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other
size.info 500295 words
size.info 586248 tokens
size.info 27829 sentences
files.count 3
files.size 18693327


 Files in this item

 Download all files in item (17.83 MB)
Icon
Name
ssj500kv1_4-sl.zip
Size
7.87 MB
Format
application/zip
Description
Corpus encoded in TEI format with annotations in Slovenian
MD5
3f4ee148c5c1da9a5c30ea9e9403bdd3
 Download file  Preview
 File Preview  
    • ssj500k-sl.xml88 MB
Icon
Name
ssj500kv1_4-en.zip
Size
7.87 MB
Format
application/zip
Description
Corpus encoded in TEI format with annotations in English
MD5
ee1dbe568c317eb0a59f889315e9500b
 Download file  Preview
 File Preview  
    • ssj500k-en.xml88 MB
Icon
Name
ssj500kv1_4.conllx.zip
Size
2.09 MB
Format
application/zip
Description
Corpus encoded in CoNLL-X format
MD5
719ea3228dc79354c34ac54157148486
 Download file  Preview
 File Preview  
    • ssj500k.conllx.tbl17 MB

Show simple item record