dc.contributor.author | Krek, Simon |
dc.contributor.author | Dobrovoljc, Kaja |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Može, Sara |
dc.contributor.author | Ledinek, Nina |
dc.contributor.author | Holz, Nanika |
dc.date.accessioned | 2016-02-13T13:44:11Z |
dc.date.available | 2016-02-13T13:44:11Z |
dc.date.issued | 2015-10-26 |
dc.identifier.uri | http://hdl.handle.net/11356/1052 |
dc.description | The ssj500k training corpus contains 500,000 words, manually annotated on the levels of tokenization, sentence segmentation, morphosyntactic tagging, lemmatisation, named entities, and, partially, syntactic dependencies. The ssj500k corpus uses the MULTEXT-East / JOS morphosyntactic tagset and the JOS dependency schema and is based on the jos100k and jos1M corpora. Note that this entry updates ssj500k 1.3 by fixing many annotation errors. |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.relation.replaces | http://hdl.handle.net/11356/1029 |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1165 |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://eng.slovenscina.eu/ucni-korpus |
dc.subject | tagging |
dc.subject | dependency treebank |
dc.subject | parsing |
dc.subject | named entities |
dc.subject | tokenisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.title | Training corpus ssj500k 1.4 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hidden | hidden |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute |
sponsor | Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other |
size.info | 500295 words |
size.info | 586248 tokens |
size.info | 27829 sentences |
files.count | 3 |
files.size | 18693327 |
Datoteke v tem vnosu
Prenesi vse datoteke v vnosu (17.83 MB)To je vnos
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)





- Ime
- ssj500kv1_4-sl.zip
- Velikost
- 7.87 MB
- Format
- application/zip
- Opis
- Corpus encoded in TEI format with annotations in Slovenian
- MD5
- 3f4ee148c5c1da9a5c30ea9e9403bdd3

- Ime
- ssj500kv1_4-en.zip
- Velikost
- 7.87 MB
- Format
- application/zip
- Opis
- Corpus encoded in TEI format with annotations in English
- MD5
- ee1dbe568c317eb0a59f889315e9500b

- Ime
- ssj500kv1_4.conllx.zip
- Velikost
- 2.09 MB
- Format
- application/zip
- Opis
- Corpus encoded in CoNLL-X format
- MD5
- 719ea3228dc79354c34ac54157148486