Show simple item record

 
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Krek, Simon
dc.contributor.author Dobrovoljc, Kaja
dc.date.accessioned 2019-02-13T17:13:39Z
dc.date.available 2019-02-13T17:13:39Z
dc.date.issued 2019-02-13
dc.identifier.uri http://hdl.handle.net/11356/1213
dc.description The jos1M corpus contains 1 million words of sampled paragraphs from the Gigafida corpus. It is meant to serve as a training corpus for word-level tagging of Slovene. This silver-standard corpus is annotated for morphosyntactic descriptions and lemmas with about one fourth of the more problematic annotations hand-validated. The morphosyntactic descriptions are given in both the JOS/MULTEXT-East framework (http://nl.ijs.si/ME/V6/msd/), as well as in the framework of Universal Dependencies for Slovene (https://universaldependencies.org/treebanks/sl_ssj/index.html). The corpus is available in source TEI XML with the MSDs in English or Slovene and in the derived vertical format, used by CQP and (no)Sketch Engine concordancers and in CONLL-U, used by Universal Dependencies. Note that the corpus does not contain syntactic dependencies. The texts or paragraphs of the jos1M corpus overlap with this of the ssj500k annotated corpus (http://hdl.handle.net/11356/1210), but the latter has been fully manually annotated, as well as having its tokenisation and sentence segmentation corrected. The texts and paragraphs in the jos1M corpus are marked if they are also included in ssj500k, while the CONLL-U is also split into the part that is included in ssj500k and that which is not. The latter can serve as an additional training set for morphosyntactic tagging and lemmatisation to ssj500k.
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby http://www.lrec-conf.org/proceedings/lrec2010/summaries/139.html
dc.relation.replaces http://hdl.handle.net/11356/1037
dc.rights Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc/4.0/
dc.rights.label PUB
dc.source.uri http://nl.ijs.si/jos/jos1M-en.html
dc.subject part-of-speech tagging
dc.subject lemmatisation
dc.subject manual annotation
dc.subject TEI
dc.subject CONLL-U
dc.title Training corpus jos1M 1.2
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J2-9180 Linguistic annotation of Slovene nationalFunds
sponsor EU FP6 033917 SMART “Statistical Multilingual Analysis for Retrieval and Translation” Other
sponsor Ministry of Higher Education, Science and Technology European Fund for Regional Development Mobile reader for blind and sight impaired persons nationalFunds
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 999961 words
size.info 1182814 tokens
size.info 2564 texts
files.count 4
files.size 113875922
featuredService.kontext search|https://www.clarin.si/kontext/first_form?corpname=jos1m
featuredService.noske search|https://www.clarin.si/ske/#dashboard?corpname=jos1m&struct_attr_stats=1&subcorpora=1


 Files in this item

 Download all files in item (108.6 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Distributed under Creative Commons Attribution Required Noncommercial
Icon
Name
jos1M-en.TEI.zip
Size
35.67 MB
Format
application/zip
Description
Corpus encoded in TEI format with annotations in English
MD5
0fc0c3b4df3bf5746f4830327ede8a03
 Download file  Preview
 File Preview  
  • jos1M-en.TEI
    • jos1M.back.xml461 kB
    • jos1M-en.body.xml162 MB
    • jos1M-en.xml24 kB
    • TEI-schema
      • tei_clarin_schema.xml3 kB
      • tei_clarin_example.xml32 kB
      • README.md442 B
      • .git
        • logs
        • info
          • exclude240 B
        • config267 B
        • index808 B
        • packed-refs107 B
        • HEAD23 B
        • refs
        • description73 B
        • hooks
          • applypatch-msg.sample478 B
          • pre-push.sample1 kB
          • commit-msg.sample896 B
          • pre-rebase.sample4 kB
          • post-update.sample189 B
          • update.sample3 kB
          • pre-applypatch.sample424 B
          • pre-commit.sample1 kB
          • prepare-commit-msg.sample1 kB
        • objects
          • eb
            • d3bd7e9fa60fab682b744b2bd5a4900b755d65140 B
          • ea
            • d3f01e5d79b5b346e77cb34ea8384f23712310139 B
          • 7a
            • 61fc40e22d2541cf5fae5ec3ca52dfedb66ddf682 kB
            • 22dd7d710ede758ab604ae807453ac31edacee324 kB
          • 59
            • 8f616fc0ab853094a995a125855f6145e75fbe390 kB
            • e0b189b4d45f548693041ec5d29e2377cfa64390 B
            • 910a1af7d9393cea81a2b6c727207f1f1eb210263 B
          • c6
            • 6ac42b9ed7619c75e78369f66c7c7959f7bcb7188 B
            • 95cfbbd18da1067d666a5d67f99a52f53e6b77154 B
          • 57
            • 1d5ec3815ce49175b946cbbc9fc225f8a5a721187 B
          • f8
            • c01bcec0b8d73ce936c84212ec50d9261e44db8 kB
          • 24
            • f82216fafe1ae5839266d22a2a434d1b6a715c137 B
          • 23
            • 69b9959f8ebebe382896705d316b41e80d9f031 kB
          • c4
            • bef5443f3ab7081761f67c369e1be09e5015c760 kB
          • c3
            • c65c5426afece513ed5fda5ec34a04c55812f8182 B
            • e682c6ae2b1ad5a06cb9ac35eee59b32307b751 kB
          • 53
            • c37344f0f4133c95c6068c9477035ead4034b4178 B
          • 52
            • fba8b1b621dd150a4031439bda47c6615678f988 kB
          • c1
            • 320fafabd00d8e551347320861b258b17ee19345 kB
          • 50
            • df3eaf6e5ec5878a22558060c420feccd3c75e69 kB
          • f0
            • 6443a2999b9980db796473cc855995ea8eb671188 B
          • 81
            • 1c952b011fa2f088fa9430da729dc3fcb4346661 kB
          • af
            • 448d16b9d9efb963af0d5c7f053970338b11cc43 kB
          • 0e
            • ca574a72a6fe5796ca8d05cc68f13285e5b3c134 kB
          • 0a
            • 89d351e8161269476c55227375387cf2d228c9146 B
          • pack
            • dc
              • e393eb6e7c8ff9eb54fde0e7e56f51de67eaa2139 B
            • da
              • 7760b5d8fa0a500aed7f8357d24ef7f18c0c87188 B
              • ddf2512b3747fd2856d08da91d3a0ebc40653b1 kB
            • 9d
              • 703398c7b8263f3be40b46e3058c01c8644dea53 kB
            • 9c
              • 174b272c7a7753c1cd9a35545ed8be8ab7a89742 kB
            • 6a
              • f61c1c472a90098e9ad1d493f03f335bd8726e956 kB
            • 18
              • e8b23448de09592cbc743900dab12184f6298b529 kB
            • b7
              • 667063cb0be59c107734d4bd5a704c5fd1d1685 MB
            • 15
              • 0846f3ea6587793f1b3d1ab37b19f343a63aaa1 kB
              • 77eb70e151c4888ea3d4ad7a365747253599c3181 B
            • 45
              • 930b93c18666cb38d8af59fac34fb777d3c76e59 kB
            • 77
              • 5c262bcfd186fbfbc27aec7d140ddb16277ff5614 kB
            • 44
              • f0cd84b45a039ba1ee44f81b9c4ce3eea4b552188 B
            • b2
              • 67a74bb425ebfca55ff140cca1d78857c9c06c696 kB
            • e2
              • 073c2b723530eb3af4b8b68da1f4e1bd76727747 kB
            • 71
              • a2cb49466543634a5b37669daf1b72a8d860e3139 B
            • 2e
              • f523c4a8551771a6c59dad3dd8c7cae8b4e28e140 B
            • ce
              • 43631349b928cc30ec55bb73ff89cb5fbf7104651 kB
            • cd
              • fd534208e8325832b2ece7c756dcda9e7ec45c267 B
            • 8f
              • f7bf441576af2cdf172650ed6c3a6b7872b97467 kB
            • fe
              • e279f73c2004bfb11506d6216c012d086283a5253 B
            • 5c
              • 1495c16244593f618ff7c3efb523eb7789f00a87 kB
            • 8d
              • 0b03188dfafa5423a4403b25ce9091bcd8f5c47 kB
            • ca
              • e3bb51d68b11ff8a6c12fff749709f4cc65fb6467 kB
            • 8a
              • ccd127ef3367f71da4ddd1abe01ab1f2e6f874846 kB
            • a8
              • dd909681f3bab7cb18d134d50661660760e32d195 B
            • 38
              • 608efc3b186c157056078a232d807f55211693188 B
            • d8
              • 26e8389cf41c2183a327e3c3269cdc286ad3214 MB
            • 05
              • ec631354498ec54ba1e6294f311126baf5954d91 B
            • 66
              • 6950df30306db2c3425ba7782c0b6befed6c84145 B
            • 34
              • 571f4e5bec6d44a6423213f5e225d6822dad84213 B
            • 65
              • 3833eed087f9828400bd10eefda72d8de0012074 kB
            • d2
              • f9d23254e6ebbbda855b7c2797a6570c159e6b353 B
            • 92
              • aee158a726211f880cc23006534fca06c5e9c139 kB
            • info
              • 1e
                • da8a44193d15f19f48605ce27caa2dfa7efd4547 kB
              • 1d
                • 9519f692fb5293ec4c17e418c12771346963e475 kB
              • bc
                • c633b2610dfc96174f03342af6139651b3f8d7283 kB
              • 1b
                • a81a5f40c6fe2a1af23cae557a28aaa1bd8877991 B
                • 9c858580595ac9130d4af22e57863bd9c4664377 kB
                • be10a9db54132c2959c1810bbfd37f59ab6da1188 B
              • 7f
                • eb048041406ce4d53c5cab4bfa82b91a13ed4f201 B
                • 27ff3e2e862429d72bf61a099b03eabdc005ca78 kB
              • 7e
                • 2673c8952a85daaa6c4393f7c60b427ffee86990 B
              • 4c
                • 05692403ceb5457162244556b5f7a2355c02ab188 B
                • b90c67bba0a46ff25cc7d431865d30bfc3e6a0578 kB
              • ed
                • 4210cc465ae5fa191cd68b1505d993fc4e34fa53 kB
            • branches
            • schema
              • tei_clarin.zip87 kB
              • tei_clarin.rnc282 kB
              • tei_clarin.dtd229 kB
              • tei_clarin.rng579 kB
            • doc
              • tei_clarin_doc.xml7 MB
              • tei_clarin_doc.html7 MB
          • 00README.txt202 B
        Icon
        Name
        jos1M-sl.TEI.zip
        Size
        35.67 MB
        Format
        application/zip
        Description
        Corpus encoded in TEI format with annotations in Slovene
        MD5
        3f22ad68fad38a7d05908b5503a9c05d
         Download file  Preview
         File Preview  
        • jos1M-sl.TEI
          • jos1M.back.xml461 kB
          • jos1M-sl.xml24 kB
          • jos1M-sl.body.xml162 MB
          • TEI-schema
            • tei_clarin_schema.xml3 kB
            • tei_clarin_example.xml32 kB
            • README.md442 B
            • .git
              • logs
              • info
                • exclude240 B
              • config267 B
              • index808 B
              • packed-refs107 B
              • HEAD23 B
              • refs
              • description73 B
              • hooks
                • applypatch-msg.sample478 B
                • pre-push.sample1 kB
                • commit-msg.sample896 B
                • pre-rebase.sample4 kB
                • post-update.sample189 B
                • update.sample3 kB
                • pre-applypatch.sample424 B
                • pre-commit.sample1 kB
                • prepare-commit-msg.sample1 kB
              • objects
                • eb
                  • d3bd7e9fa60fab682b744b2bd5a4900b755d65140 B
                • ea
                  • d3f01e5d79b5b346e77cb34ea8384f23712310139 B
                • 7a
                  • 61fc40e22d2541cf5fae5ec3ca52dfedb66ddf682 kB
                  • 22dd7d710ede758ab604ae807453ac31edacee324 kB
                • 59
                  • 8f616fc0ab853094a995a125855f6145e75fbe390 kB
                  • e0b189b4d45f548693041ec5d29e2377cfa64390 B
                  • 910a1af7d9393cea81a2b6c727207f1f1eb210263 B
                • c6
                  • 6ac42b9ed7619c75e78369f66c7c7959f7bcb7188 B
                  • 95cfbbd18da1067d666a5d67f99a52f53e6b77154 B
                • 57
                  • 1d5ec3815ce49175b946cbbc9fc225f8a5a721187 B
                • f8
                  • c01bcec0b8d73ce936c84212ec50d9261e44db8 kB
                • 24
                  • f82216fafe1ae5839266d22a2a434d1b6a715c137 B
                • 23
                  • 69b9959f8ebebe382896705d316b41e80d9f031 kB
                • c4
                  • bef5443f3ab7081761f67c369e1be09e5015c760 kB
                • c3
                  • c65c5426afece513ed5fda5ec34a04c55812f8182 B
                  • e682c6ae2b1ad5a06cb9ac35eee59b32307b751 kB
                • 53
                  • c37344f0f4133c95c6068c9477035ead4034b4178 B
                • 52
                  • fba8b1b621dd150a4031439bda47c6615678f988 kB
                • c1
                  • 320fafabd00d8e551347320861b258b17ee19345 kB
                • 50
                  • df3eaf6e5ec5878a22558060c420feccd3c75e69 kB
                • f0
                  • 6443a2999b9980db796473cc855995ea8eb671188 B
                • 81
                  • 1c952b011fa2f088fa9430da729dc3fcb4346661 kB
                • af
                  • 448d16b9d9efb963af0d5c7f053970338b11cc43 kB
                • 0e
                  • ca574a72a6fe5796ca8d05cc68f13285e5b3c134 kB
                • 0a
                  • 89d351e8161269476c55227375387cf2d228c9146 B
                • pack
                  • dc
                    • e393eb6e7c8ff9eb54fde0e7e56f51de67eaa2139 B
                  • da
                    • 7760b5d8fa0a500aed7f8357d24ef7f18c0c87188 B
                    • ddf2512b3747fd2856d08da91d3a0ebc40653b1 kB
                  • 9d
                    • 703398c7b8263f3be40b46e3058c01c8644dea53 kB
                  • 9c
                    • 174b272c7a7753c1cd9a35545ed8be8ab7a89742 kB
                  • 6a
                    • f61c1c472a90098e9ad1d493f03f335bd8726e956 kB
                  • 18
                    • e8b23448de09592cbc743900dab12184f6298b529 kB
                  • b7
                    • 667063cb0be59c107734d4bd5a704c5fd1d1685 MB
                  • 15
                    • 0846f3ea6587793f1b3d1ab37b19f343a63aaa1 kB
                    • 77eb70e151c4888ea3d4ad7a365747253599c3181 B
                  • 45
                    • 930b93c18666cb38d8af59fac34fb777d3c76e59 kB
                  • 77
                    • 5c262bcfd186fbfbc27aec7d140ddb16277ff5614 kB
                  • 44
                    • f0cd84b45a039ba1ee44f81b9c4ce3eea4b552188 B
                  • b2
                    • 67a74bb425ebfca55ff140cca1d78857c9c06c696 kB
                  • e2
                    • 073c2b723530eb3af4b8b68da1f4e1bd76727747 kB
                  • 71
                    • a2cb49466543634a5b37669daf1b72a8d860e3139 B
                  • 2e
                    • f523c4a8551771a6c59dad3dd8c7cae8b4e28e140 B
                  • ce
                    • 43631349b928cc30ec55bb73ff89cb5fbf7104651 kB
                  • cd
                    • fd534208e8325832b2ece7c756dcda9e7ec45c267 B
                  • 8f
                    • f7bf441576af2cdf172650ed6c3a6b7872b97467 kB
                  • fe
                    • e279f73c2004bfb11506d6216c012d086283a5253 B
                  • 5c
                    • 1495c16244593f618ff7c3efb523eb7789f00a87 kB
                  • 8d
                    • 0b03188dfafa5423a4403b25ce9091bcd8f5c47 kB
                  • ca
                    • e3bb51d68b11ff8a6c12fff749709f4cc65fb6467 kB
                  • 8a
                    • ccd127ef3367f71da4ddd1abe01ab1f2e6f874846 kB
                  • a8
                    • dd909681f3bab7cb18d134d50661660760e32d195 B
                  • 38
                    • 608efc3b186c157056078a232d807f55211693188 B
                  • d8
                    • 26e8389cf41c2183a327e3c3269cdc286ad3214 MB
                  • 05
                    • ec631354498ec54ba1e6294f311126baf5954d91 B
                  • 66
                    • 6950df30306db2c3425ba7782c0b6befed6c84145 B
                  • 34
                    • 571f4e5bec6d44a6423213f5e225d6822dad84213 B
                  • 65
                    • 3833eed087f9828400bd10eefda72d8de0012074 kB
                  • d2
                    • f9d23254e6ebbbda855b7c2797a6570c159e6b353 B
                  • 92
                    • aee158a726211f880cc23006534fca06c5e9c139 kB
                  • info
                    • 1e
                      • da8a44193d15f19f48605ce27caa2dfa7efd4547 kB
                    • 1d
                      • 9519f692fb5293ec4c17e418c12771346963e475 kB
                    • bc
                      • c633b2610dfc96174f03342af6139651b3f8d7283 kB
                    • 1b
                      • a81a5f40c6fe2a1af23cae557a28aaa1bd8877991 B
                      • 9c858580595ac9130d4af22e57863bd9c4664377 kB
                      • be10a9db54132c2959c1810bbfd37f59ab6da1188 B
                    • 7f
                      • eb048041406ce4d53c5cab4bfa82b91a13ed4f201 B
                      • 27ff3e2e862429d72bf61a099b03eabdc005ca78 kB
                    • 7e
                      • 2673c8952a85daaa6c4393f7c60b427ffee86990 B
                    • 4c
                      • 05692403ceb5457162244556b5f7a2355c02ab188 B
                      • b90c67bba0a46ff25cc7d431865d30bfc3e6a0578 kB
                    • ed
                      • 4210cc465ae5fa191cd68b1505d993fc4e34fa53 kB
                  • branches
                  • schema
                    • tei_clarin.zip87 kB
                    • tei_clarin.rnc282 kB
                    • tei_clarin.dtd229 kB
                    • tei_clarin.rng579 kB
                  • doc
                    • tei_clarin_doc.xml7 MB
                    • tei_clarin_doc.html7 MB
                • 00README.txt202 B
              Icon
              Name
              jos1M.vert.zip
              Size
              13.18 MB
              Format
              application/zip
              Description
              Corpus in derived vertical (Sketch Engine / CQP) format
              MD5
              361c447d4a062be21ffed3fa90f902a2
               Download file  Preview
               File Preview  
              • jos1M.vert
                • jos1M.vert93 MB
                • jos1m.regi3 kB
                • 00README.txt202 B
              Icon
              Name
              jos1M.conllu.zip
              Size
              24.07 MB
              Format
              application/zip
              Description
              Corpus in derived CONLL-U format
              MD5
              2264757cbc5b2de5b32106c06fb5c14d
               Download file  Preview
               File Preview  
              • jos1M.conllu
                • jos1M_ssj500k_no-ud-morphology.conllu47 MB
                • jos1M_ssj500k_yes-ud-morphology.conllu30 MB
                • jos1M-ud-morphology.conllu77 MB
                • 00README.txt202 B

              Show simple item record