dc.contributor.author | Krek, Simon |
dc.contributor.author | Dobrovoljc, Kaja |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Može, Sara |
dc.contributor.author | Ledinek, Nina |
dc.contributor.author | Holz, Nanika |
dc.contributor.author | Zupan, Katja |
dc.contributor.author | Gantar, Polona |
dc.contributor.author | Kuzman, Taja |
dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Kavčič, Teja |
dc.contributor.author | Škrjanec, Iza |
dc.contributor.author | Marko, Dafne |
dc.contributor.author | Jezeršek, Lucija |
dc.contributor.author | Zajc, Anja |
dc.date.accessioned | 2019-01-26T20:37:28Z |
dc.date.available | 2019-01-26T20:37:28Z |
dc.date.issued | 2019-01-26 |
dc.identifier.uri | http://hdl.handle.net/11356/1210 |
dc.description | The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. About half of the corpus is also manually annotated with syntactic dependencies, named entities, and verbal multiword expressions. About a quarter of the corpus is annotated with semantic role labels. The morphosyntactic tags and syntactic dependencies are included both in the JOS/MULTEXT-East framework, as well as in the framework of Universal Dependencies. The annotations of the ssj500k corpus follow (1) the MULTEXT-East V6 morphosyntactic specifications for Slovene, http://nl.ijs.si/ME/V6/msd/, (2) the JOS dependency schema, http://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, the Universal Dependencies morphosyntactic specifications and syntactic dependencies for Slovene-SSJ, https://universaldependencies.org/, (4) the Janes annotation guidelines for Slovenian named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, and (5) the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/ The vocabulary of (1) and (2) is provided in the back element and (3), (4), and (5) in the teiHeader of the TEI encoded corpus. The semantic role labels are also documented in the teiHeader. In contrast to the previous version 2.1, this version corrects various errata in spacing and text metadata and adds UD morphological and (where it was possible to do so automatically) dependency annotations to the corpus. Note that the UD annotations are not included in the vertical file. |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.relation.isreferencedby | http://dx.doi.org/10.18653/v1/W17-1406 |
dc.relation.replaces | http://hdl.handle.net/11356/1181 |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1434 |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://eng.slovenscina.eu/ucni-korpus.html |
dc.subject | part-of-speech tagging |
dc.subject | dependency treebank |
dc.subject | parsing |
dc.subject | named entities |
dc.subject | tokenisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.subject | verbal multiword expressions |
dc.subject | semantic role labelling |
dc.subject | CONLL-U |
dc.title | Training corpus ssj500k 2.2 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hidden | false |
hidden | hidden |
hasMetadata | false |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute |
sponsor | Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds |
sponsor | ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
size.info | 586248 tokens |
size.info | 27829 sentences |
size.info | 500295 words |
files.count | 4 |
files.size | 42941421 |
featuredService.kontext | search|https://www.clarin.si/kontext/first_form?corpname=ssj500k22 |
featuredService.noske | search|https://www.clarin.si/ske/#dashboard?corpname=ssj500k22 |
Datoteke v tem vnosu
Prenesi vse datoteke v vnosu (40.95 MB)To je vnos
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)





- Ime
- ssj500k.conllu.zip
- Velikost
- 10 MB
- Format
- application/zip
- Opis
- Corpus in CONLL-U format, complete corpus with UD morphology and separately the UD syntactically annotated part, also split into train/dev/test.
- MD5
- f65ae2995a2a7acfe43b1a5aa3140dca
- ssj500k.conllu
- ssj500k-ud-morphology.conllu38 MB
- sl_ssj-ud_v2.4-dev.conllu1 MB
- sl_ssj-ud_v2.4.conllu11 MB
- sl_ssj-ud_v2.4-train.conllu9 MB
- sl_ssj-ud_v2.4-test.conllu1 MB
- 00README.txt147 B

- Ime
- ssj500k-en.TEI.zip
- Velikost
- 11.92 MB
- Format
- application/zip
- Opis
- Corpus encoded in TEI format with annotations in English
- MD5
- 2c5bb4d729bb03dbc2d88d8358196cfa
- ssj500k-en.TEI
- ssj500k.back.xml552 kB
- ssj500k-en.xml51 kB
- schema
- tei_clarin_doc.xml7 MB
- tei_clarin.zip87 kB
- tei_clarin.rnc282 kB
- tei_clarin_schema.xml3 kB
- tei_clarin_example.xml32 kB
- tei_clarin.dtd229 kB
- tei_clarin_doc.html7 MB
- tei_clarin.rng579 kB
- 00README.txt147 B
- ssj500k-en.body.xml98 MB

- Ime
- ssj500k-sl.TEI.zip
- Velikost
- 11.92 MB
- Format
- application/zip
- Opis
- Corpus encoded in TEI format with annotations in Slovene
- MD5
- da8d2116b54be5d26ec675e8bb5fc996
- ssj500k-sl.TEI
- ssj500k-sl.xml51 kB
- ssj500k-sl.body.xml98 MB
- ssj500k.back.xml552 kB
- schema
- tei_clarin_doc.xml7 MB
- tei_clarin.zip87 kB
- tei_clarin.rnc282 kB
- tei_clarin_schema.xml3 kB
- tei_clarin_example.xml32 kB
- tei_clarin.dtd229 kB
- tei_clarin_doc.html7 MB
- tei_clarin.rng579 kB
- 00README.txt147 B

- Ime
- ssj500k.vert.zip
- Velikost
- 7.12 MB
- Format
- application/zip
- Opis
- Corpus in derived vertical (Sketch Engine / CQP) format
- MD5
- 4c30a74912329a5252f942829f0f4a79
- ssj500k.vert
- ssj500k22.vert44 MB
- ssj500k22.regi4 kB
- 00README.txt147 B