dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Krek, Simon |
dc.contributor.author | Dobrovoljc, Kaja |
dc.date.accessioned | 2019-02-13T17:13:39Z |
dc.date.available | 2019-02-13T17:13:39Z |
dc.date.issued | 2019-02-13 |
dc.identifier.uri | http://hdl.handle.net/11356/1213 |
dc.description | The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions and lemmas with about one fourth of the more problematic annotations hand-validated. The morphosyntactic descriptions are given in both the JOS/MULTEXT-East framework (http://nl.ijs.si/ME/V6/msd/), as well as in the framework of Universal Dependencies for Slovene (https://universaldependencies.org/treebanks/sl_ssj/index.html). The corpus is available in source TEI XML with the MSDs in English or Slovene and in the derived vertical format, used by CQP and (no)Sketch Engine concordancers and in CONLL-U, used by Universal Dependencies. Note that the corpus does not contain syntactic dependencies. The texts or paragraphs of the jos1M corpus overlap with this of the ssj500k annotated corpus (http://hdl.handle.net/11356/1210), but the latter has been fully manually annotated, as well as having its tokenisation and sentence segmentation corrected. The texts and paragraphs in the jos1M corpus are marked if they are also included in ssj500k, while the CONLL-U is also split into the part that is included in ssj500k and that which is not. The latter can serve as an additional training set for morphosyntactic tagging and lemmatisation to ssj500k. |
dc.language.iso | slv |
dc.publisher | Jožef Stefan Institute |
dc.relation.isreferencedby | http://www.lrec-conf.org/proceedings/lrec2010/summaries/139.html |
dc.relation.replaces | http://hdl.handle.net/11356/1037 |
dc.rights | Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://nl.ijs.si/jos/jos1M-en.html |
dc.subject | part-of-speech tagging |
dc.subject | lemmatisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.subject | CONLL-U |
dc.title | Training corpus jos1M 1.2 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor | ARRS (Slovenian Research Agency) J2-9180 Linguistic annotation of Slovene nationalFunds |
sponsor | EU FP6 033917 SMART “Statistical Multilingual Analysis for Retrieval and Translation” Other |
sponsor | Ministry of Higher Education, Science and Technology European Fund for Regional Development Mobile reader for blind and sight impaired persons nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
size.info | 999961 words |
size.info | 1182814 tokens |
size.info | 2564 texts |
files.count | 4 |
files.size | 113875922 |
featuredService.kontext | search|https://www.clarin.si/kontext/first_form?corpname=jos1m |
featuredService.noske | search|https://www.clarin.si/ske/#dashboard?corpname=jos1m |
Files in this item
Download all files in item (108.6 MB)This item is
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)




- Name
- jos1M-en.TEI.zip
- Size
- 35.67 MB
- Format
- application/zip
- Description
- Corpus encoded in TEI format with annotations in English
- MD5
- 0fc0c3b4df3bf5746f4830327ede8a03
- jos1M-en.TEI
- jos1M.back.xml461 kB
- jos1M-en.body.xml162 MB
- jos1M-en.xml24 kB
- TEI-schema
- tei_clarin_schema.xml3 kB
- tei_clarin_example.xml32 kB
- README.md442 B
- .git
- logs
- info
- exclude240 B
- config267 B
- index808 B
- packed-refs107 B
- HEAD23 B
- refs
- description73 B
- hooks
- applypatch-msg.sample478 B
- pre-push.sample1 kB
- commit-msg.sample896 B
- pre-rebase.sample4 kB
- post-update.sample189 B
- update.sample3 kB
- pre-applypatch.sample424 B
- pre-commit.sample1 kB
- prepare-commit-msg.sample1 kB
- objects
- eb
- d3bd7e9fa60fab682b744b2bd5a4900b755d65140 B
- ea
- d3f01e5d79b5b346e77cb34ea8384f23712310139 B
- 7a
- 61fc40e22d2541cf5fae5ec3ca52dfedb66ddf682 kB
- 22dd7d710ede758ab604ae807453ac31edacee324 kB
- 59
- 8f616fc0ab853094a995a125855f6145e75fbe390 kB
- e0b189b4d45f548693041ec5d29e2377cfa64390 B
- 910a1af7d9393cea81a2b6c727207f1f1eb210263 B
- c6
- 6ac42b9ed7619c75e78369f66c7c7959f7bcb7188 B
- 95cfbbd18da1067d666a5d67f99a52f53e6b77154 B
- 57
- 1d5ec3815ce49175b946cbbc9fc225f8a5a721187 B
- f8
- c01bcec0b8d73ce936c84212ec50d9261e44db8 kB
- 24
- f82216fafe1ae5839266d22a2a434d1b6a715c137 B
- 23
- 69b9959f8ebebe382896705d316b41e80d9f031 kB
- c4
- bef5443f3ab7081761f67c369e1be09e5015c760 kB
- c3
- c65c5426afece513ed5fda5ec34a04c55812f8182 B
- e682c6ae2b1ad5a06cb9ac35eee59b32307b751 kB
- 53
- c37344f0f4133c95c6068c9477035ead4034b4178 B
- 52
- fba8b1b621dd150a4031439bda47c6615678f988 kB
- c1
- 320fafabd00d8e551347320861b258b17ee19345 kB
- 50
- df3eaf6e5ec5878a22558060c420feccd3c75e69 kB
- f0
- 6443a2999b9980db796473cc855995ea8eb671188 B
- 81
- 1c952b011fa2f088fa9430da729dc3fcb4346661 kB
- af
- 448d16b9d9efb963af0d5c7f053970338b11cc43 kB
- 0e
- ca574a72a6fe5796ca8d05cc68f13285e5b3c134 kB
- 0a
- 89d351e8161269476c55227375387cf2d228c9146 B
- pack
- dc
- e393eb6e7c8ff9eb54fde0e7e56f51de67eaa2139 B
- da
- 7760b5d8fa0a500aed7f8357d24ef7f18c0c87188 B
- ddf2512b3747fd2856d08da91d3a0ebc40653b1 kB
- 9d
- 703398c7b8263f3be40b46e3058c01c8644dea53 kB
- 9c
- 174b272c7a7753c1cd9a35545ed8be8ab7a89742 kB
- 6a
- f61c1c472a90098e9ad1d493f03f335bd8726e956 kB
- 18
- e8b23448de09592cbc743900dab12184f6298b529 kB
- b7
- 667063cb0be59c107734d4bd5a704c5fd1d1685 MB
- 15
- 0846f3ea6587793f1b3d1ab37b19f343a63aaa1 kB
- 77eb70e151c4888ea3d4ad7a365747253599c3181 B
- 45
- 930b93c18666cb38d8af59fac34fb777d3c76e59 kB
- 77
- 5c262bcfd186fbfbc27aec7d140ddb16277ff5614 kB
- 44
- f0cd84b45a039ba1ee44f81b9c4ce3eea4b552188 B
- b2
- 67a74bb425ebfca55ff140cca1d78857c9c06c696 kB
- e2
- 073c2b723530eb3af4b8b68da1f4e1bd76727747 kB
- 71
- a2cb49466543634a5b37669daf1b72a8d860e3139 B
- 2e
- f523c4a8551771a6c59dad3dd8c7cae8b4e28e140 B
- ce
- 43631349b928cc30ec55bb73ff89cb5fbf7104651 kB
- cd
- fd534208e8325832b2ece7c756dcda9e7ec45c267 B
- 8f
- f7bf441576af2cdf172650ed6c3a6b7872b97467 kB
- fe
- e279f73c2004bfb11506d6216c012d086283a5253 B
- 5c
- 1495c16244593f618ff7c3efb523eb7789f00a87 kB
- 8d
- 0b03188dfafa5423a4403b25ce9091bcd8f5c47 kB
- ca
- e3bb51d68b11ff8a6c12fff749709f4cc65fb6467 kB
- 8a
- ccd127ef3367f71da4ddd1abe01ab1f2e6f874846 kB
- a8
- dd909681f3bab7cb18d134d50661660760e32d195 B
- 38
- 608efc3b186c157056078a232d807f55211693188 B
- d8
- 26e8389cf41c2183a327e3c3269cdc286ad3214 MB
- 05
- ec631354498ec54ba1e6294f311126baf5954d91 B
- 66
- 6950df30306db2c3425ba7782c0b6befed6c84145 B
- 34
- 571f4e5bec6d44a6423213f5e225d6822dad84213 B
- 65
- 3833eed087f9828400bd10eefda72d8de0012074 kB
- d2
- f9d23254e6ebbbda855b7c2797a6570c159e6b353 B
- 92
- aee158a726211f880cc23006534fca06c5e9c139 kB
- info
- 1e
- da8a44193d15f19f48605ce27caa2dfa7efd4547 kB
- 1d
- 9519f692fb5293ec4c17e418c12771346963e475 kB
- bc
- c633b2610dfc96174f03342af6139651b3f8d7283 kB
- 1b
- a81a5f40c6fe2a1af23cae557a28aaa1bd8877991 B
- 9c858580595ac9130d4af22e57863bd9c4664377 kB
- be10a9db54132c2959c1810bbfd37f59ab6da1188 B
- 7f
- eb048041406ce4d53c5cab4bfa82b91a13ed4f201 B
- 27ff3e2e862429d72bf61a099b03eabdc005ca78 kB
- 7e
- 2673c8952a85daaa6c4393f7c60b427ffee86990 B
- 4c
- 05692403ceb5457162244556b5f7a2355c02ab188 B
- b90c67bba0a46ff25cc7d431865d30bfc3e6a0578 kB
- ed
- 4210cc465ae5fa191cd68b1505d993fc4e34fa53 kB
- eb
- branches
- schema
- tei_clarin.zip87 kB
- tei_clarin.rnc282 kB
- tei_clarin.dtd229 kB
- tei_clarin.rng579 kB
- doc
- tei_clarin_doc.xml7 MB
- tei_clarin_doc.html7 MB
- 00README.txt202 B

- Name
- jos1M-sl.TEI.zip
- Size
- 35.67 MB
- Format
- application/zip
- Description
- Corpus encoded in TEI format with annotations in Slovene
- MD5
- 3f22ad68fad38a7d05908b5503a9c05d
- jos1M-sl.TEI
- jos1M.back.xml461 kB
- jos1M-sl.xml24 kB
- jos1M-sl.body.xml162 MB
- TEI-schema
- tei_clarin_schema.xml3 kB
- tei_clarin_example.xml32 kB
- README.md442 B
- .git
- logs
- info
- exclude240 B
- config267 B
- index808 B
- packed-refs107 B
- HEAD23 B
- refs
- description73 B
- hooks
- applypatch-msg.sample478 B
- pre-push.sample1 kB
- commit-msg.sample896 B
- pre-rebase.sample4 kB
- post-update.sample189 B
- update.sample3 kB
- pre-applypatch.sample424 B
- pre-commit.sample1 kB
- prepare-commit-msg.sample1 kB
- objects
- eb
- d3bd7e9fa60fab682b744b2bd5a4900b755d65140 B
- ea
- d3f01e5d79b5b346e77cb34ea8384f23712310139 B
- 7a
- 61fc40e22d2541cf5fae5ec3ca52dfedb66ddf682 kB
- 22dd7d710ede758ab604ae807453ac31edacee324 kB
- 59
- 8f616fc0ab853094a995a125855f6145e75fbe390 kB
- e0b189b4d45f548693041ec5d29e2377cfa64390 B
- 910a1af7d9393cea81a2b6c727207f1f1eb210263 B
- c6
- 6ac42b9ed7619c75e78369f66c7c7959f7bcb7188 B
- 95cfbbd18da1067d666a5d67f99a52f53e6b77154 B
- 57
- 1d5ec3815ce49175b946cbbc9fc225f8a5a721187 B
- f8
- c01bcec0b8d73ce936c84212ec50d9261e44db8 kB
- 24
- f82216fafe1ae5839266d22a2a434d1b6a715c137 B
- 23
- 69b9959f8ebebe382896705d316b41e80d9f031 kB
- c4
- bef5443f3ab7081761f67c369e1be09e5015c760 kB
- c3
- c65c5426afece513ed5fda5ec34a04c55812f8182 B
- e682c6ae2b1ad5a06cb9ac35eee59b32307b751 kB
- 53
- c37344f0f4133c95c6068c9477035ead4034b4178 B
- 52
- fba8b1b621dd150a4031439bda47c6615678f988 kB
- c1
- 320fafabd00d8e551347320861b258b17ee19345 kB
- 50
- df3eaf6e5ec5878a22558060c420feccd3c75e69 kB
- f0
- 6443a2999b9980db796473cc855995ea8eb671188 B
- 81
- 1c952b011fa2f088fa9430da729dc3fcb4346661 kB
- af
- 448d16b9d9efb963af0d5c7f053970338b11cc43 kB
- 0e
- ca574a72a6fe5796ca8d05cc68f13285e5b3c134 kB
- 0a
- 89d351e8161269476c55227375387cf2d228c9146 B
- pack
- dc
- e393eb6e7c8ff9eb54fde0e7e56f51de67eaa2139 B
- da
- 7760b5d8fa0a500aed7f8357d24ef7f18c0c87188 B
- ddf2512b3747fd2856d08da91d3a0ebc40653b1 kB
- 9d
- 703398c7b8263f3be40b46e3058c01c8644dea53 kB
- 9c
- 174b272c7a7753c1cd9a35545ed8be8ab7a89742 kB
- 6a
- f61c1c472a90098e9ad1d493f03f335bd8726e956 kB
- 18
- e8b23448de09592cbc743900dab12184f6298b529 kB
- b7
- 667063cb0be59c107734d4bd5a704c5fd1d1685 MB
- 15
- 0846f3ea6587793f1b3d1ab37b19f343a63aaa1 kB
- 77eb70e151c4888ea3d4ad7a365747253599c3181 B
- 45
- 930b93c18666cb38d8af59fac34fb777d3c76e59 kB
- 77
- 5c262bcfd186fbfbc27aec7d140ddb16277ff5614 kB
- 44
- f0cd84b45a039ba1ee44f81b9c4ce3eea4b552188 B
- b2
- 67a74bb425ebfca55ff140cca1d78857c9c06c696 kB
- e2
- 073c2b723530eb3af4b8b68da1f4e1bd76727747 kB
- 71
- a2cb49466543634a5b37669daf1b72a8d860e3139 B
- 2e
- f523c4a8551771a6c59dad3dd8c7cae8b4e28e140 B
- ce
- 43631349b928cc30ec55bb73ff89cb5fbf7104651 kB
- cd
- fd534208e8325832b2ece7c756dcda9e7ec45c267 B
- 8f
- f7bf441576af2cdf172650ed6c3a6b7872b97467 kB
- fe
- e279f73c2004bfb11506d6216c012d086283a5253 B
- 5c
- 1495c16244593f618ff7c3efb523eb7789f00a87 kB
- 8d
- 0b03188dfafa5423a4403b25ce9091bcd8f5c47 kB
- ca
- e3bb51d68b11ff8a6c12fff749709f4cc65fb6467 kB
- 8a
- ccd127ef3367f71da4ddd1abe01ab1f2e6f874846 kB
- a8
- dd909681f3bab7cb18d134d50661660760e32d195 B
- 38
- 608efc3b186c157056078a232d807f55211693188 B
- d8
- 26e8389cf41c2183a327e3c3269cdc286ad3214 MB
- 05
- ec631354498ec54ba1e6294f311126baf5954d91 B
- 66
- 6950df30306db2c3425ba7782c0b6befed6c84145 B
- 34
- 571f4e5bec6d44a6423213f5e225d6822dad84213 B
- 65
- 3833eed087f9828400bd10eefda72d8de0012074 kB
- d2
- f9d23254e6ebbbda855b7c2797a6570c159e6b353 B
- 92
- aee158a726211f880cc23006534fca06c5e9c139 kB
- info
- 1e
- da8a44193d15f19f48605ce27caa2dfa7efd4547 kB
- 1d
- 9519f692fb5293ec4c17e418c12771346963e475 kB
- bc
- c633b2610dfc96174f03342af6139651b3f8d7283 kB
- 1b
- a81a5f40c6fe2a1af23cae557a28aaa1bd8877991 B
- 9c858580595ac9130d4af22e57863bd9c4664377 kB
- be10a9db54132c2959c1810bbfd37f59ab6da1188 B
- 7f
- eb048041406ce4d53c5cab4bfa82b91a13ed4f201 B
- 27ff3e2e862429d72bf61a099b03eabdc005ca78 kB
- 7e
- 2673c8952a85daaa6c4393f7c60b427ffee86990 B
- 4c
- 05692403ceb5457162244556b5f7a2355c02ab188 B
- b90c67bba0a46ff25cc7d431865d30bfc3e6a0578 kB
- ed
- 4210cc465ae5fa191cd68b1505d993fc4e34fa53 kB
- eb
- branches
- schema
- tei_clarin.zip87 kB
- tei_clarin.rnc282 kB
- tei_clarin.dtd229 kB
- tei_clarin.rng579 kB
- doc
- tei_clarin_doc.xml7 MB
- tei_clarin_doc.html7 MB
- 00README.txt202 B

- Name
- jos1M.vert.zip
- Size
- 13.18 MB
- Format
- application/zip
- Description
- Corpus in derived vertical (Sketch Engine / CQP) format
- MD5
- 361c447d4a062be21ffed3fa90f902a2
- jos1M.vert
- jos1M.vert93 MB
- jos1m.regi3 kB
- 00README.txt202 B

- Name
- jos1M.conllu.zip
- Size
- 24.07 MB
- Format
- application/zip
- Description
- Corpus in derived CONLL-U format
- MD5
- 2264757cbc5b2de5b32106c06fb5c14d
- jos1M.conllu
- jos1M_ssj500k_no-ud-morphology.conllu47 MB
- jos1M_ssj500k_yes-ud-morphology.conllu30 MB
- jos1M-ud-morphology.conllu77 MB
- 00README.txt202 B