Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Agić, Željko
dc.contributor.author Klubička, Filip
dc.contributor.author Batanović, Vuk
dc.contributor.author Erjavec, Tomaž
dc.date.accessioned 2018-04-16T11:06:31Z
dc.date.available 2018-04-16T11:06:31Z
dc.date.issued 2018-04-13
dc.identifier.uri http://hdl.handle.net/11356/1183
dc.description The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotations (and other aspects) of the hr500k corpus are documented in the teiHeader and back element of the TEI encoded corpus. In short, they follow (1) the MULTEXT-East V5 morphosyntactic specifications for Croatian, http://nl.ijs.si/ME/V5/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, and (3) the Janes annotation guidelines for named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, while (4) the semantic role labelling annotation guidelines are currently in the publication process.
dc.language.iso hrv
dc.publisher Jožef Stefan Institute
dc.relation.isreferencedby http://www.lrec-conf.org/proceedings/lrec2016/summaries/340.html
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri https://github.com/nljubesi/hr500k
dc.subject part-of-speech tagging
dc.subject dependency treebank
dc.subject parsing
dc.subject named entities
dc.subject tokenisation
dc.subject manual annotation
dc.subject TEI
dc.subject semantic role labelling
dc.title Training corpus hr500k 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor Swiss National Science Foundation 160501 ReLDI Other
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 900 texts
size.info 24624 sentences
size.info 506457 tokens
files.count 3
files.size 95979137
featuredService.kontext search|https://www.clarin.si/kontext/first_form?corpname=hr500k
featuredService.noske search|https://www.clarin.si/noske/run.cgi/corp_info?corpname=hr500k


 Files in this item

 Download all files in item (91.53 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
hr500k.TEI.zip
Size
28.64 MB
Format
application/zip
Description
Corpus in TEI format
MD5
2e7671953af5fb9c40168094a63e2e89
 Download file  Preview
 File Preview  
  • hr500k.TEI
    • hr500k.back.xml81 kB
    • hr500k.body.xml88 MB
    • hr500k.xml18 kB
    • TEI-schema
      • tei_clarin_schema.xml2 kB
      • tei_clarin_example.xml31 kB
      • README.md442 B
      • .git
        • logs
        • info
          • exclude240 B
        • config267 B
        • index984 B
        • packed-refs107 B
        • HEAD23 B
        • refs
        • description73 B
        • hooks
          • applypatch-msg.sample478 B
          • pre-push.sample1 kB
          • commit-msg.sample896 B
          • pre-rebase.sample4 kB
          • post-update.sample189 B
          • update.sample3 kB
          • pre-applypatch.sample424 B
          • pre-commit.sample1 kB
          • prepare-commit-msg.sample1 kB
        • objects
          • 38
            • 608efc3b186c157056078a232d807f55211693188 B
          • b7
            • 667063cb0be59c107734d4bd5a704c5fd1d1685 MB
          • 0e
            • ca574a72a6fe5796ca8d05cc68f13285e5b3c134 kB
          • 66
            • 6950df30306db2c3425ba7782c0b6befed6c84145 B
          • ed
            • 4210cc465ae5fa191cd68b1505d993fc4e34fa53 kB
          • bc
            • c633b2610dfc96174f03342af6139651b3f8d7283 kB
          • 0a
            • 89d351e8161269476c55227375387cf2d228c9146 B
          • 9c
            • 174b272c7a7753c1cd9a35545ed8be8ab7a89742 kB
          • e2
            • 073c2b723530eb3af4b8b68da1f4e1bd76727747 kB
          • 92
            • aee158a726211f880cc23006534fca06c5e9c139 kB
          • 18
            • e8b23448de09592cbc743900dab12184f6298b529 kB
          • 1e
            • da8a44193d15f19f48605ce27caa2dfa7efd4547 kB
          • 7f
            • eb048041406ce4d53c5cab4bfa82b91a13ed4f201 B
          • c6
            • 6ac42b9ed7619c75e78369f66c7c7959f7bcb7188 B
            • 95cfbbd18da1067d666a5d67f99a52f53e6b77154 B
          • cd
            • fd534208e8325832b2ece7c756dcda9e7ec45c267 B
          • fe
            • e279f73c2004bfb11506d6216c012d086283a5253 B
          • 1b
            • a81a5f40c6fe2a1af23cae557a28aaa1bd8877991 B
          • c4
            • bef5443f3ab7081761f67c369e1be09e5015c760 kB
          • ca
            • e3bb51d68b11ff8a6c12fff749709f4cc65fb6467 kB
          • 7a
            • 61fc40e22d2541cf5fae5ec3ca52dfedb66ddf682 kB
            • 22dd7d710ede758ab604ae807453ac31edacee324 kB
          • 71
            • a2cb49466543634a5b37669daf1b72a8d860e3139 B
          • f0
            • 6443a2999b9980db796473cc855995ea8eb671188 B
          • pack
            • 59
              • 8f616fc0ab853094a995a125855f6145e75fbe390 kB
              • 910a1af7d9393cea81a2b6c727207f1f1eb210263 B
            • d8
              • 26e8389cf41c2183a327e3c3269cdc286ad3214 MB
            • info
              • 57
                • 1d5ec3815ce49175b946cbbc9fc225f8a5a721187 B
              • 2e
                • f523c4a8551771a6c59dad3dd8c7cae8b4e28e140 B
              • 8f
                • f7bf441576af2cdf172650ed6c3a6b7872b97467 kB
              • 24
                • f82216fafe1ae5839266d22a2a434d1b6a715c137 B
              • 23
                • 69b9959f8ebebe382896705d316b41e80d9f031 kB
              • 8d
                • 0b03188dfafa5423a4403b25ce9091bcd8f5c47 kB
              • 53
                • c37344f0f4133c95c6068c9477035ead4034b4178 B
              • d2
                • f9d23254e6ebbbda855b7c2797a6570c159e6b353 B
            • branches
            • schema
              • tei_clarin.zip47 kB
              • tei_clarin.rnc206 kB
              • tei_clarin.dtd167 kB
              • tei_clarin.rng424 kB
            • doc
              • tei_clarin_doc.xml3 MB
              • tei_clarin_doc.html2 MB
              • tei_clarin_doc.docx698 kB
              • tei_clarin_doc.pdf5 MB
          • 00README.txt203 B
        Icon
        Name
        hr500k.vert.zip
        Size
        7.02 MB
        Format
        application/zip
        Description
        Corpus in derived vertical format
        MD5
        fbb28589ad8b9d9bf7ed23861e7340af
         Download file  Preview
         File Preview  
        • hr500k.vert
          • hr500k.vert55 MB
          • hr500k.regi3 kB
          • 00README.txt203 B
        Icon
        Name
        hr500k.conll.zip
        Size
        55.87 MB
        Format
        application/zip
        Description
        Source CoNLL-like format from GitHub (commit 9e5f556)
        MD5
        dd35a86769338e710ac43f7c60dde089
         Download file  Preview
         File Preview  
        • hr500k.conll
          • MSD_final.pdf115 kB
          • README.md8 B
          • msd_mapper.py6 kB
          • .git
            • logs
            • info
              • exclude240 B
            • config259 B
            • index424 B
            • packed-refs107 B
            • HEAD23 B
            • refs
            • description73 B
            • hooks
              • applypatch-msg.sample478 B
              • pre-push.sample1 kB
              • commit-msg.sample896 B
              • pre-rebase.sample4 kB
              • post-update.sample189 B
              • update.sample3 kB
              • pre-applypatch.sample424 B
              • pre-commit.sample1 kB
              • prepare-commit-msg.sample1 kB
            • objects
              • pack
                • pack-ac2a0d245bfc18b52b289abb816dc6c8424043ba.pack48 MB
                • pack-ac2a0d245bfc18b52b289abb816dc6c8424043ba.idx3 kB
              • info
              • branches
              • hr500k.conll73 MB
              • 00README.txt203 B
              • mte5-udv2.mapping122 kB

            Show simple item record