dc.contributor.author | Krek, Simon |
dc.contributor.author | Dobrovoljc, Kaja |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Može, Sara |
dc.contributor.author | Ledinek, Nina |
dc.contributor.author | Holz, Nanika |
dc.contributor.author | Zupan, Katja |
dc.contributor.author | Gantar, Polona |
dc.contributor.author | Kuzman, Taja |
dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Kavčič, Teja |
dc.contributor.author | Škrjanec, Iza |
dc.contributor.author | Marko, Dafne |
dc.contributor.author | Jezeršek, Lucija |
dc.contributor.author | Zajc, Anja |
dc.date.accessioned | 2021-07-07T15:28:52Z |
dc.date.available | 2021-07-07T15:28:52Z |
dc.date.issued | 2021-07-07 |
dc.identifier.uri | http://hdl.handle.net/11356/1434 |
dc.description | The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. About half of the corpus is also manually annotated with syntactic dependencies, named entities, and verbal multiword expressions. About a quarter of the corpus is also annotated with semantic role labels. The morphosyntactic tags and syntactic dependencies are included both in the JOS/MULTEXT-East framework, as well as in the framework of Universal Dependencies. The annotations of the ssj500k corpus follow (1) the MULTEXT-East V6 morphosyntactic specifications for Slovene, http://nl.ijs.si/ME/V6/msd/, (2) the JOS dependency schema, http://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, the Universal Dependencies morphosyntactic specifications and syntactic dependencies for Slovene-SSJ, https://universaldependencies.org/, (4) the Janes annotation guidelines for Slovenian named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, and (5) the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/ The vocabulary of (1) and (2) is provided in the back element and (3), (4), and (5) in the teiHeader of the TEI encoded corpus. The semantic role labels are also documented in the teiHeader. In contrast to the previous version 2.2, this version includes the corrected Universal Dependencies relations from UD version 2.8, updates the TEI encoding and adds UD annotations to the vertical file. |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.relation.isreferencedby | http://dx.doi.org/10.18653/v1/W17-1406 |
dc.relation.replaces | http://hdl.handle.net/11356/1210 |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1747 |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.label | PUB |
dc.subject | part-of-speech tagging |
dc.subject | dependency treebank |
dc.subject | parsing |
dc.subject | named entities |
dc.subject | tokenisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.subject | verbal multiword expressions |
dc.subject | semantic role labelling |
dc.subject | CONLL-U |
dc.title | Training corpus ssj500k 2.3 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hidden | false |
hasMetadata | false |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds |
sponsor | ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other |
size.info | 586248 tokens |
size.info | 27829 sentences |
size.info | 500295 words |
files.count | 4 |
files.size | 44929526 |
featuredService.kontext | search|https://www.clarin.si/kontext/first_form?corpname=ssj500k23 |
featuredService.noske | search|https://www.clarin.si/ske/#dashboard?corpname=ssj500k23 |
Datoteke v tem vnosu
Prenesi vse datoteke v vnosu (42.85 MB)To je vnos
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)





- Ime
- ssj500k.conllu.zip
- Velikost
- 8.55 MB
- Format
- application/zip
- Opis
- Corpus in CONLL-U format: complete corpus with UD morphology and separately the UD syntactically annotated part split into train/dev/test
- MD5
- ebfd53684457bb4651c13b9bebb45423
- ssj500k.conllu
- sl_ssj-ud-dev.conllu1 MB
- ssj500k-morpho.conllu42 MB
- sl_ssj-ud-test.conllu1 MB
- 00README.txt148 B
- sl_ssj-ud-train.conllu9 MB

- Ime
- ssj500k-en.TEI.zip
- Velikost
- 12.42 MB
- Format
- application/zip
- Opis
- Corpus encoded in TEI format with annotations in English
- MD5
- 735165b029f6f739082c15e74ee7a7da
- ssj500k-en.TEI
- ssj500k.back.xml500 kB
- ssj500k-en.xml50 kB
- schema
- tei_clarin_doc.xml7 MB
- tei_clarin.zip87 kB
- tei_clarin_example.xml32 kB
- tei_clarin.rnc282 kB
- tei_clarin_schema.xml3 kB
- tei_clarin.dtd229 kB
- tei_clarin_doc.html7 MB
- tei_clarin.rng579 kB
- 00README.txt148 B
- ssj500k-en.body.xml101 MB

- Ime
- ssj500k-sl.TEI.zip
- Velikost
- 12.42 MB
- Format
- application/zip
- Opis
- Corpus encoded in TEI format with annotations in Slovene
- MD5
- 8d15e43fba438b2dcf328e7453efcd0a
- ssj500k-sl.TEI
- ssj500k-sl.xml50 kB
- ssj500k-sl.body.xml101 MB
- ssj500k.back.xml500 kB
- schema
- tei_clarin_doc.xml7 MB
- tei_clarin.zip87 kB
- tei_clarin_example.xml32 kB
- tei_clarin.rnc282 kB
- tei_clarin_schema.xml3 kB
- tei_clarin.dtd229 kB
- tei_clarin_doc.html7 MB
- tei_clarin.rng579 kB
- 00README.txt148 B

- Ime
- ssj500k.vert.zip
- Velikost
- 9.46 MB
- Format
- application/zip
- Opis
- Corpus in derived vertical (Sketch Engine / CQP) format
- MD5
- 8e7c003641c19be0c107b6043eb7f81a
- ssj500k.vert
- ssj500k23.vert89 MB
- ssj500k23.regi4 kB
- 00README.txt148 B