Show simple item record

 
dc.contributor.author Krek, Simon
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Može, Sara
dc.contributor.author Ledinek, Nina
dc.contributor.author Holz, Nanika
dc.contributor.author Zupan, Katja
dc.contributor.author Gantar, Polona
dc.contributor.author Kuzman, Taja
dc.date.accessioned 2017-11-23T21:36:56Z
dc.date.available 2017-11-23T21:36:56Z
dc.date.issued 2017-11-23
dc.identifier.uri http://hdl.handle.net/11356/1165
dc.description The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. About half of the corpus is also manually annotated with syntactic dependencies, named entities, and verbal multiword expressions. The annotations of the ssj500k corpus follow (1) the MULTEXT-East V5 morphosyntactic specifications for Slovene, https://nl.ijs.si/ME/V5/msd/, (2) the JOS dependency schema, https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, (3) the Janes Annotation guidelines for Slovenian named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, and the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.0/ The vocabulary of (1) and (2) is provided in the back element and (3) and (4) in the teiHeader of the TEI encoded corpus.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.replaces http://hdl.handle.net/11356/1052
dc.relation.isreplacedby http://hdl.handle.net/11356/1181
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.source.uri http://eng.slovenscina.eu/tehnologije/ucni-korpus
dc.subject tagging
dc.subject dependency treebank
dc.subject parsing
dc.subject named entities
dc.subject tokenisation
dc.subject manual annotation
dc.subject TEI
dc.subject verbal multiword expressions
dc.title Training corpus ssj500k 2.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hidden hidden
hasMetadata false
has.files yes
branding CLARIN.SI data & tools
contact.person Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute
sponsor Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds
size.info 586248 tokens
size.info 27829 sentences
size.info 500293 words
files.count 3
files.size 52041634


 Files in this item

 Download all files in item (49.63 MB)
Icon
Name
ssj500k-en.TEI.zip
Size
21.28 MB
Format
application/zip
Description
Corpus encoded in TEI format with annotations in English
MD5
dc196518015e1943e21883ccaae82fd9
 Download file  Preview
 File Preview  
  • ssj500k-en.TEI
    • ssj500k.back.xml552 kB
    • ssj500k-en.xml39 kB
    • TEI-schema
      • tei_clarin_schema.xml2 kB
      • tei_clarin_example.xml31 kB
      • README.md423 B
      • .git
        • logs
        • info
          • exclude240 B
        • config267 B
        • index984 B
        • packed-refs107 B
        • HEAD23 B
        • refs
        • description73 B
        • hooks
          • applypatch-msg.sample478 B
          • pre-push.sample1 kB
          • commit-msg.sample896 B
          • pre-rebase.sample4 kB
          • post-update.sample189 B
          • update.sample3 kB
          • pre-applypatch.sample424 B
          • pre-commit.sample1 kB
          • prepare-commit-msg.sample1 kB
        • objects
          • 38
            • 608efc3b186c157056078a232d807f55211693188 B
          • 0e
            • ca574a72a6fe5796ca8d05cc68f13285e5b3c134 kB
          • 66
            • 6950df30306db2c3425ba7782c0b6befed6c84145 B
          • bc
            • c633b2610dfc96174f03342af6139651b3f8d7283 kB
          • 9c
            • 174b272c7a7753c1cd9a35545ed8be8ab7a89742 kB
          • e2
            • 073c2b723530eb3af4b8b68da1f4e1bd76727747 kB
          • 18
            • e8b23448de09592cbc743900dab12184f6298b529 kB
          • c6
            • 6ac42b9ed7619c75e78369f66c7c7959f7bcb7188 B
            • 95cfbbd18da1067d666a5d67f99a52f53e6b77154 B
          • cd
            • fd534208e8325832b2ece7c756dcda9e7ec45c267 B
          • fe
            • e279f73c2004bfb11506d6216c012d086283a5253 B
          • 1b
            • a81a5f40c6fe2a1af23cae557a28aaa1bd8877991 B
          • c4
            • bef5443f3ab7081761f67c369e1be09e5015c760 kB
          • f0
            • 6443a2999b9980db796473cc855995ea8eb671188 B
          • pack
            • 59
              • 8f616fc0ab853094a995a125855f6145e75fbe390 kB
            • d8
              • 26e8389cf41c2183a327e3c3269cdc286ad3214 MB
            • info
              • 2e
                • f523c4a8551771a6c59dad3dd8c7cae8b4e28e140 B
              • 24
                • f82216fafe1ae5839266d22a2a434d1b6a715c137 B
              • 8d
                • 0b03188dfafa5423a4403b25ce9091bcd8f5c47 kB
              • 53
                • c37344f0f4133c95c6068c9477035ead4034b4178 B
              • d2
                • f9d23254e6ebbbda855b7c2797a6570c159e6b353 B
            • branches
            • schema
              • tei_clarin.zip42 kB
              • tei_clarin.rnc184 kB
              • tei_clarin.dtd146 kB
              • tei_clarin.rng377 kB
            • doc
              • tei_clarin_doc.xml3 MB
              • tei_clarin_doc.html1 MB
              • tei_clarin_doc.docx546 kB
              • tei_clarin_doc.pdf4 MB
          • schema
            • tei_clarin.zip40 kB
            • tei_clarin_schema.xml1 kB
            • tei_clarin.rnc172 kB
            • tei_clarin_doc.html1 MB
            • tei_clarin.rng354 kB
            • tei_clarin_doc.pdf1 MB
          • 00README.txt147 B
          • ssj500k-en.body.xml80 MB
        Icon
        Name
        ssj500k-sl.TEI.zip
        Size
        21.28 MB
        Format
        application/zip
        Description
        Corpus encoded in TEI format with annotations in Slovene
        MD5
        0998bae603ff2dd315954a080001a3a5
         Download file  Preview
         File Preview  
        • ssj500k-sl.TEI
          • ssj500k-sl.xml39 kB
          • ssj500k-sl.body.xml80 MB
          • ssj500k.back.xml552 kB
          • TEI-schema
            • tei_clarin_schema.xml2 kB
            • tei_clarin_example.xml31 kB
            • README.md423 B
            • .git
              • logs
              • info
                • exclude240 B
              • config267 B
              • index984 B
              • packed-refs107 B
              • HEAD23 B
              • refs
              • description73 B
              • hooks
                • applypatch-msg.sample478 B
                • pre-push.sample1 kB
                • commit-msg.sample896 B
                • pre-rebase.sample4 kB
                • post-update.sample189 B
                • update.sample3 kB
                • pre-applypatch.sample424 B
                • pre-commit.sample1 kB
                • prepare-commit-msg.sample1 kB
              • objects
                • 38
                  • 608efc3b186c157056078a232d807f55211693188 B
                • 0e
                  • ca574a72a6fe5796ca8d05cc68f13285e5b3c134 kB
                • 66
                  • 6950df30306db2c3425ba7782c0b6befed6c84145 B
                • bc
                  • c633b2610dfc96174f03342af6139651b3f8d7283 kB
                • 9c
                  • 174b272c7a7753c1cd9a35545ed8be8ab7a89742 kB
                • e2
                  • 073c2b723530eb3af4b8b68da1f4e1bd76727747 kB
                • 18
                  • e8b23448de09592cbc743900dab12184f6298b529 kB
                • c6
                  • 6ac42b9ed7619c75e78369f66c7c7959f7bcb7188 B
                  • 95cfbbd18da1067d666a5d67f99a52f53e6b77154 B
                • cd
                  • fd534208e8325832b2ece7c756dcda9e7ec45c267 B
                • fe
                  • e279f73c2004bfb11506d6216c012d086283a5253 B
                • 1b
                  • a81a5f40c6fe2a1af23cae557a28aaa1bd8877991 B
                • c4
                  • bef5443f3ab7081761f67c369e1be09e5015c760 kB
                • f0
                  • 6443a2999b9980db796473cc855995ea8eb671188 B
                • pack
                  • 59
                    • 8f616fc0ab853094a995a125855f6145e75fbe390 kB
                  • d8
                    • 26e8389cf41c2183a327e3c3269cdc286ad3214 MB
                  • info
                    • 2e
                      • f523c4a8551771a6c59dad3dd8c7cae8b4e28e140 B
                    • 24
                      • f82216fafe1ae5839266d22a2a434d1b6a715c137 B
                    • 8d
                      • 0b03188dfafa5423a4403b25ce9091bcd8f5c47 kB
                    • 53
                      • c37344f0f4133c95c6068c9477035ead4034b4178 B
                    • d2
                      • f9d23254e6ebbbda855b7c2797a6570c159e6b353 B
                  • branches
                  • schema
                    • tei_clarin.zip42 kB
                    • tei_clarin.rnc184 kB
                    • tei_clarin.dtd146 kB
                    • tei_clarin.rng377 kB
                  • doc
                    • tei_clarin_doc.xml3 MB
                    • tei_clarin_doc.html1 MB
                    • tei_clarin_doc.docx546 kB
                    • tei_clarin_doc.pdf4 MB
                • schema
                  • tei_clarin.zip40 kB
                  • tei_clarin_schema.xml1 kB
                  • tei_clarin.rnc172 kB
                  • tei_clarin_doc.html1 MB
                  • tei_clarin.rng354 kB
                  • tei_clarin_doc.pdf1 MB
                • 00README.txt147 B
              Icon
              Name
              ssj500k.vert.zip
              Size
              7.07 MB
              Format
              application/zip
              Description
              Corpus encoded in Sketch Engine (vertical) format
              MD5
              2dc4a7d319631379049599b76b94f9ef
               Download file  Preview
               File Preview  
              • ssj500k.vert
                • ssj500k20.vert44 MB
                • 00README.txt147 B
                • ssj500k20.regi3 kB

              Show simple item record