dc.contributor.author | Krek, Simon |
dc.contributor.author | Dobrovoljc, Kaja |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Može, Sara |
dc.contributor.author | Ledinek, Nina |
dc.contributor.author | Holz, Nanika |
dc.contributor.author | Zupan, Katja |
dc.contributor.author | Gantar, Polona |
dc.contributor.author | Kuzman, Taja |
dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Kavčič, Teja |
dc.contributor.author | Škrjanec, Iza |
dc.contributor.author | Marko, Dafne |
dc.contributor.author | Jezeršek, Lucija |
dc.contributor.author | Zajc, Anja |
dc.date.accessioned | 2018-03-16T17:57:20Z |
dc.date.available | 2018-03-16T17:57:20Z |
dc.date.issued | 2018-03-16 |
dc.identifier.uri | http://hdl.handle.net/11356/1181 |
dc.description | The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. About half of the corpus is also manually annotated with syntactic dependencies, named entities, and verbal multiword expressions. About a quarter of the corpus is annotated with semantic role labels. The annotations of the ssj500k corpus follow (1) the MULTEXT-East V5 morphosyntactic specifications for Slovene, https://nl.ijs.si/ME/V5/msd/, (2) the JOS dependency schema, https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, (3) the Janes annotation guidelines for Slovenian named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, and (4) the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/ The vocabulary of (1) and (2) is provided in the back element and (3) and (4) in the teiHeader of the TEI encoded corpus. The semantic role labels are also documented in the teiHeader. |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.relation.replaces | http://hdl.handle.net/11356/1165 |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1210 |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://eng.slovenscina.eu/ucni-korpus.html |
dc.subject | tagging |
dc.subject | dependency treebank |
dc.subject | parsing |
dc.subject | named entities |
dc.subject | tokenisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.subject | verbal multiword expressions |
dc.subject | semantic role labelling |
dc.title | Training corpus ssj500k 2.1 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
hidden | hidden |
hasMetadata | false |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute |
sponsor | Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds |
sponsor | ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds |
size.info | 586248 tokens |
size.info | 27829 sentences |
size.info | 500293 words |
files.count | 3 |
files.size | 59017894 |
featuredService.kontext | search|https://www.clarin.si/kontext/first_form?corpname=siparl20 |
featuredService.noske | search|https://www.clarin.si/ske/#dashboard?corpname=siparl20 |
Datoteke v tem vnosu
Prenesi vse datoteke v vnosu (56.28 MB)To je vnos
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)





- Ime
- ssj500k-en.TEI.zip
- Velikost
- 28.61 MB
- Format
- application/zip
- Opis
- Corpus encoded in TEI format with annotations in English
- MD5
- 03b7628391fd95b522c1c651bd3998f8
- ssj500k-en.TEI
- ssj500k.back.xml552 kB
- ssj500k-en.xml51 kB
- TEI-schema
- tei_clarin_schema.xml2 kB
- tei_clarin_example.xml31 kB
- README.md442 B
- .git
- logs
- info
- exclude240 B
- config267 B
- index984 B
- packed-refs107 B
- HEAD23 B
- refs
- description73 B
- hooks
- applypatch-msg.sample478 B
- pre-push.sample1 kB
- commit-msg.sample896 B
- pre-rebase.sample4 kB
- post-update.sample189 B
- update.sample3 kB
- pre-applypatch.sample424 B
- pre-commit.sample1 kB
- prepare-commit-msg.sample1 kB
- objects
- 38
- 608efc3b186c157056078a232d807f55211693188 B
- b7
- 667063cb0be59c107734d4bd5a704c5fd1d1685 MB
- 0e
- ca574a72a6fe5796ca8d05cc68f13285e5b3c134 kB
- 66
- 6950df30306db2c3425ba7782c0b6befed6c84145 B
- ed
- 4210cc465ae5fa191cd68b1505d993fc4e34fa53 kB
- bc
- c633b2610dfc96174f03342af6139651b3f8d7283 kB
- 0a
- 89d351e8161269476c55227375387cf2d228c9146 B
- 9c
- 174b272c7a7753c1cd9a35545ed8be8ab7a89742 kB
- e2
- 073c2b723530eb3af4b8b68da1f4e1bd76727747 kB
- 92
- aee158a726211f880cc23006534fca06c5e9c139 kB
- 18
- e8b23448de09592cbc743900dab12184f6298b529 kB
- 1e
- da8a44193d15f19f48605ce27caa2dfa7efd4547 kB
- 7f
- eb048041406ce4d53c5cab4bfa82b91a13ed4f201 B
- c6
- 6ac42b9ed7619c75e78369f66c7c7959f7bcb7188 B
- 95cfbbd18da1067d666a5d67f99a52f53e6b77154 B
- cd
- fd534208e8325832b2ece7c756dcda9e7ec45c267 B
- fe
- e279f73c2004bfb11506d6216c012d086283a5253 B
- 1b
- a81a5f40c6fe2a1af23cae557a28aaa1bd8877991 B
- c4
- bef5443f3ab7081761f67c369e1be09e5015c760 kB
- ca
- e3bb51d68b11ff8a6c12fff749709f4cc65fb6467 kB
- 7a
- 61fc40e22d2541cf5fae5ec3ca52dfedb66ddf682 kB
- 22dd7d710ede758ab604ae807453ac31edacee324 kB
- 71
- a2cb49466543634a5b37669daf1b72a8d860e3139 B
- f0
- 6443a2999b9980db796473cc855995ea8eb671188 B
- pack
- 59
- 8f616fc0ab853094a995a125855f6145e75fbe390 kB
- 910a1af7d9393cea81a2b6c727207f1f1eb210263 B
- d8
- 26e8389cf41c2183a327e3c3269cdc286ad3214 MB
- info
- 57
- 1d5ec3815ce49175b946cbbc9fc225f8a5a721187 B
- 2e
- f523c4a8551771a6c59dad3dd8c7cae8b4e28e140 B
- 8f
- f7bf441576af2cdf172650ed6c3a6b7872b97467 kB
- 24
- f82216fafe1ae5839266d22a2a434d1b6a715c137 B
- 23
- 69b9959f8ebebe382896705d316b41e80d9f031 kB
- 8d
- 0b03188dfafa5423a4403b25ce9091bcd8f5c47 kB
- 53
- c37344f0f4133c95c6068c9477035ead4034b4178 B
- d2
- f9d23254e6ebbbda855b7c2797a6570c159e6b353 B
- 38
- branches
- schema
- tei_clarin.zip47 kB
- tei_clarin.rnc206 kB
- tei_clarin.dtd167 kB
- tei_clarin.rng424 kB
- doc
- tei_clarin_doc.xml3 MB
- tei_clarin_doc.html2 MB
- tei_clarin_doc.docx698 kB
- tei_clarin_doc.pdf5 MB
- 00README.txt147 B
- ssj500k-en.body.xml83 MB

- Ime
- ssj500k-sl.TEI.zip
- Velikost
- 20.59 MB
- Format
- application/zip
- Opis
- Corpus encoded in TEI format with annotations in Slovene
- MD5
- 3484473205bb72906ed6ce95fadf9a92
- ssj500k-sl.TEI
- ssj500k-sl.xml46 kB
- ssj500k-sl.body.xml83 MB
- ssj500k.back.xml552 kB
- TEI-schema
- tei_clarin_schema.xml2 kB
- tei_clarin_example.xml31 kB
- README.md423 B
- .git
- logs
- info
- exclude240 B
- config267 B
- index984 B
- packed-refs107 B
- HEAD23 B
- refs
- description73 B
- hooks
- applypatch-msg.sample478 B
- pre-push.sample1 kB
- commit-msg.sample896 B
- pre-rebase.sample4 kB
- post-update.sample189 B
- update.sample3 kB
- pre-applypatch.sample424 B
- pre-commit.sample1 kB
- prepare-commit-msg.sample1 kB
- objects
- 38
- 608efc3b186c157056078a232d807f55211693188 B
- 0e
- ca574a72a6fe5796ca8d05cc68f13285e5b3c134 kB
- 66
- 6950df30306db2c3425ba7782c0b6befed6c84145 B
- bc
- c633b2610dfc96174f03342af6139651b3f8d7283 kB
- 9c
- 174b272c7a7753c1cd9a35545ed8be8ab7a89742 kB
- e2
- 073c2b723530eb3af4b8b68da1f4e1bd76727747 kB
- 18
- e8b23448de09592cbc743900dab12184f6298b529 kB
- c6
- 6ac42b9ed7619c75e78369f66c7c7959f7bcb7188 B
- 95cfbbd18da1067d666a5d67f99a52f53e6b77154 B
- cd
- fd534208e8325832b2ece7c756dcda9e7ec45c267 B
- fe
- e279f73c2004bfb11506d6216c012d086283a5253 B
- 1b
- a81a5f40c6fe2a1af23cae557a28aaa1bd8877991 B
- c4
- bef5443f3ab7081761f67c369e1be09e5015c760 kB
- f0
- 6443a2999b9980db796473cc855995ea8eb671188 B
- pack
- 59
- 8f616fc0ab853094a995a125855f6145e75fbe390 kB
- d8
- 26e8389cf41c2183a327e3c3269cdc286ad3214 MB
- info
- 2e
- f523c4a8551771a6c59dad3dd8c7cae8b4e28e140 B
- 24
- f82216fafe1ae5839266d22a2a434d1b6a715c137 B
- 8d
- 0b03188dfafa5423a4403b25ce9091bcd8f5c47 kB
- 53
- c37344f0f4133c95c6068c9477035ead4034b4178 B
- d2
- f9d23254e6ebbbda855b7c2797a6570c159e6b353 B
- 38
- branches
- schema
- tei_clarin.zip42 kB
- tei_clarin.rnc184 kB
- tei_clarin.dtd146 kB
- tei_clarin.rng377 kB
- doc
- tei_clarin_doc.xml3 MB
- tei_clarin_doc.html1 MB
- tei_clarin_doc.docx546 kB
- tei_clarin_doc.pdf4 MB
- 00README.txt147 B

- Ime
- ssj500k.vert.zip
- Velikost
- 7.08 MB
- Format
- application/zip
- Opis
- Corpus encoded in Sketch Engine (vertical) format
- MD5
- d58b5bf8101c5c8c2435597c8e2091d1
- ssj500k.vert
- ssj500k21.vert44 MB
- ssj500k21.regi4 kB
- 00README.txt147 B