Show simple item record

 
dc.contributor.author Krek, Simon
dc.contributor.author Dobrovoljc, Kaja
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Može, Sara
dc.contributor.author Ledinek, Nina
dc.contributor.author Holz, Nanika
dc.contributor.author Zupan, Katja
dc.contributor.author Gantar, Polona
dc.contributor.author Kuzman, Taja
dc.contributor.author Čibej, Jaka
dc.contributor.author Arhar Holdt, Špela
dc.contributor.author Kavčič, Teja
dc.contributor.author Škrjanec, Iza
dc.contributor.author Marko, Dafne
dc.contributor.author Jezeršek, Lucija
dc.contributor.author Zajc, Anja
dc.date.accessioned 2021-07-07T15:28:52Z
dc.date.available 2021-07-07T15:28:52Z
dc.date.issued 2021-07-07
dc.identifier.uri http://hdl.handle.net/11356/1434
dc.description The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. About half of the corpus is also manually annotated with syntactic dependencies, named entities, and verbal multiword expressions. About a quarter of the corpus is also annotated with semantic role labels. The morphosyntactic tags and syntactic dependencies are included both in the JOS/MULTEXT-East framework, as well as in the framework of Universal Dependencies. The annotations of the ssj500k corpus follow (1) the MULTEXT-East V6 morphosyntactic specifications for Slovene, http://nl.ijs.si/ME/V6/msd/, (2) the JOS dependency schema, http://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, the Universal Dependencies morphosyntactic specifications and syntactic dependencies for Slovene-SSJ, https://universaldependencies.org/, (4) the Janes annotation guidelines for Slovenian named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, and (5) the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/ The vocabulary of (1) and (2) is provided in the back element and (3), (4), and (5) in the teiHeader of the TEI encoded corpus. The semantic role labels are also documented in the teiHeader. In contrast to the previous version 2.2, this version includes the corrected Universal Dependencies relations from UD version 2.8, updates the TEI encoding and adds UD annotations to the vertical file.
dc.language.iso slv
dc.publisher Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.isreferencedby http://dx.doi.org/10.18653/v1/W17-1406
dc.relation.replaces http://hdl.handle.net/11356/1210
dc.relation.isreplacedby http://hdl.handle.net/11356/1747
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.subject part-of-speech tagging
dc.subject dependency treebank
dc.subject parsing
dc.subject named entities
dc.subject tokenisation
dc.subject manual annotation
dc.subject TEI
dc.subject verbal multiword expressions
dc.subject semantic role labelling
dc.subject CONLL-U
dc.title Training corpus ssj500k 2.3
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
hidden false
hasMetadata false
has.files yes
branding CLARIN.SI data & tools
contact.person Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other
sponsor ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
sponsor ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
size.info 586248 tokens
size.info 27829 sentences
size.info 500295 words
files.count 4
files.size 44929526
featuredService.kontext search|https://www.clarin.si/kontext/first_form?corpname=ssj500k23
featuredService.noske search|https://www.clarin.si/ske/#dashboard?corpname=ssj500k23&struct_attr_stats=1


 Files in this item

 Download all files in item (42.85 MB)
Icon
Name
ssj500k.conllu.zip
Size
8.55 MB
Format
application/zip
Description
Corpus in CONLL-U format: complete corpus with UD morphology and separately the UD syntactically annotated part split into train/dev/test
MD5
ebfd53684457bb4651c13b9bebb45423
 Download file  Preview
 File Preview  
  • ssj500k.conllu
    • sl_ssj-ud-dev.conllu1 MB
    • ssj500k-morpho.conllu42 MB
    • sl_ssj-ud-test.conllu1 MB
    • 00README.txt148 B
    • sl_ssj-ud-train.conllu9 MB
Icon
Name
ssj500k-en.TEI.zip
Size
12.42 MB
Format
application/zip
Description
Corpus encoded in TEI format with annotations in English
MD5
735165b029f6f739082c15e74ee7a7da
 Download file  Preview
 File Preview  
  • ssj500k-en.TEI
    • ssj500k.back.xml500 kB
    • ssj500k-en.xml50 kB
    • schema
      • tei_clarin_doc.xml7 MB
      • tei_clarin.zip87 kB
      • tei_clarin_example.xml32 kB
      • tei_clarin.rnc282 kB
      • tei_clarin_schema.xml3 kB
      • tei_clarin.dtd229 kB
      • tei_clarin_doc.html7 MB
      • tei_clarin.rng579 kB
    • 00README.txt148 B
    • ssj500k-en.body.xml101 MB
Icon
Name
ssj500k-sl.TEI.zip
Size
12.42 MB
Format
application/zip
Description
Corpus encoded in TEI format with annotations in Slovene
MD5
8d15e43fba438b2dcf328e7453efcd0a
 Download file  Preview
 File Preview  
  • ssj500k-sl.TEI
    • ssj500k-sl.xml50 kB
    • ssj500k-sl.body.xml101 MB
    • ssj500k.back.xml500 kB
    • schema
      • tei_clarin_doc.xml7 MB
      • tei_clarin.zip87 kB
      • tei_clarin_example.xml32 kB
      • tei_clarin.rnc282 kB
      • tei_clarin_schema.xml3 kB
      • tei_clarin.dtd229 kB
      • tei_clarin_doc.html7 MB
      • tei_clarin.rng579 kB
    • 00README.txt148 B
Icon
Name
ssj500k.vert.zip
Size
9.46 MB
Format
application/zip
Description
Corpus in derived vertical (Sketch Engine / CQP) format
MD5
8e7c003641c19be0c107b6043eb7f81a
 Download file  Preview
 File Preview  
  • ssj500k.vert
    • ssj500k23.vert89 MB
    • ssj500k23.regi4 kB
    • 00README.txt148 B

Show simple item record