Show simple item record

 
dc.contributor.author Krek, Simon
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Može, Sara
dc.contributor.author Ledinek, Nina
dc.contributor.author Holz, Nanika
dc.contributor.author Zupan, Katja
dc.contributor.author Gantar, Polona
dc.contributor.author Kuzman, Taja
dc.contributor.author Čibej, Jaka
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Kavčič, Teja
dc.contributor.author Škrjanec, Iza
dc.contributor.author Marko, Dafne
dc.contributor.author Jezeršek, Lucija
dc.contributor.author Zajc, Anja
dc.date.accessioned 2019-01-26T20:37:28Z
dc.date.available 2019-01-26T20:37:28Z
dc.date.issued 2019-01-26
dc.identifier.uri http://hdl.handle.net/11356/1210
dc.description The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. About half of the corpus is also manually annotated with syntactic dependencies, named entities, and verbal multiword expressions. About a quarter of the corpus is annotated with semantic role labels. The morphosyntactic tags and syntactic dependencies are included both in the JOS/MULTEXT-East framework, as well as in the framework of Universal Dependencies. The annotations of the ssj500k corpus follow (1) the MULTEXT-East V6 morphosyntactic specifications for Slovene, http://nl.ijs.si/ME/V6/msd/, (2) the JOS dependency schema, http://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, the Universal Dependencies morphosyntactic specifications and syntactic dependencies for Slovene-SSJ, https://universaldependencies.org/, (4) the Janes annotation guidelines for Slovenian named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, and (5) the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/ The vocabulary of (1) and (2) is provided in the back element and (3), (4), and (5) in the teiHeader of the TEI encoded corpus. The semantic role labels are also documented in the teiHeader. In contrast to the previous version 2.1, this version corrects various errata in spacing and text metadata and adds UD morphological and (where it was possible to do so automatically) dependency annotations to the corpus. Note that the UD annotations are not included in the vertical file.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby http://dx.doi.org/10.18653/v1/W17-1406
dc.relation.replaces http://hdl.handle.net/11356/1181
dc.relation.isreplacedby http://hdl.handle.net/11356/1434
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.source.uri http://eng.slovenscina.eu/tehnologije/ucni-korpus
dc.subject part-of-speech tagging
dc.subject dependency treebank
dc.subject parsing
dc.subject named entities
dc.subject tokenisation
dc.subject manual annotation
dc.subject TEI
dc.subject verbal multiword expressions
dc.subject semantic role labelling
dc.subject CONLL-U
dc.title Training corpus ssj500k 2.2
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hidden false
hidden hidden
hasMetadata false
has.files yes
branding CLARIN.SI data & tools
contact.person Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute
sponsor Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
sponsor ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info 586248 tokens
size.info 27829 sentences
size.info 500295 words
files.count 4
files.size 42941421
featuredService.kontext search|https://www.clarin.si/kontext/first_form?corpname=ssj500k22
featuredService.noske search|https://www.clarin.si/noske/run.cgi/corp_info?corpname=ssj500k22&struct_attr_stats=1&subcorpora=1


 Files in this item

 Download all files in item (40.95 MB)
Icon
Name
ssj500k.conllu.zip
Size
10 MB
Format
application/zip
Description
Corpus in CONLL-U format, complete corpus with UD morphology and separately the UD syntactically annotated part, also split into train/dev/test.
MD5
f65ae2995a2a7acfe43b1a5aa3140dca
 Download file  Preview
 File Preview  
  • ssj500k.conllu
    • ssj500k-ud-morphology.conllu38 MB
    • sl_ssj-ud_v2.4-dev.conllu1 MB
    • sl_ssj-ud_v2.4.conllu11 MB
    • sl_ssj-ud_v2.4-train.conllu9 MB
    • sl_ssj-ud_v2.4-test.conllu1 MB
    • 00README.txt147 B
Icon
Name
ssj500k-en.TEI.zip
Size
11.92 MB
Format
application/zip
Description
Corpus encoded in TEI format with annotations in English
MD5
2c5bb4d729bb03dbc2d88d8358196cfa
 Download file  Preview
 File Preview  
  • ssj500k-en.TEI
    • ssj500k.back.xml552 kB
    • ssj500k-en.xml51 kB
    • schema
      • tei_clarin_doc.xml7 MB
      • tei_clarin.zip87 kB
      • tei_clarin.rnc282 kB
      • tei_clarin_schema.xml3 kB
      • tei_clarin_example.xml32 kB
      • tei_clarin.dtd229 kB
      • tei_clarin_doc.html7 MB
      • tei_clarin.rng579 kB
    • 00README.txt147 B
    • ssj500k-en.body.xml98 MB
Icon
Name
ssj500k-sl.TEI.zip
Size
11.92 MB
Format
application/zip
Description
Corpus encoded in TEI format with annotations in Slovene
MD5
da8d2116b54be5d26ec675e8bb5fc996
 Download file  Preview
 File Preview  
  • ssj500k-sl.TEI
    • ssj500k-sl.xml51 kB
    • ssj500k-sl.body.xml98 MB
    • ssj500k.back.xml552 kB
    • schema
      • tei_clarin_doc.xml7 MB
      • tei_clarin.zip87 kB
      • tei_clarin.rnc282 kB
      • tei_clarin_schema.xml3 kB
      • tei_clarin_example.xml32 kB
      • tei_clarin.dtd229 kB
      • tei_clarin_doc.html7 MB
      • tei_clarin.rng579 kB
    • 00README.txt147 B
Icon
Name
ssj500k.vert.zip
Size
7.12 MB
Format
application/zip
Description
Corpus in derived vertical (Sketch Engine / CQP) format
MD5
4c30a74912329a5252f942829f0f4a79
 Download file  Preview
 File Preview  
  • ssj500k.vert
    • ssj500k22.vert44 MB
    • ssj500k22.regi4 kB
    • 00README.txt147 B

Show simple item record