dc.contributor.author | Arhar Holdt, Špela |
dc.contributor.author | Krek, Simon |
dc.contributor.author | Dobrovoljc, Kaja |
dc.contributor.author | Erjavec, Tomaž |
dc.contributor.author | Gantar, Polona |
dc.contributor.author | Čibej, Jaka |
dc.contributor.author | Pori, Eva |
dc.contributor.author | Terčon, Luka |
dc.contributor.author | Munda, Tina |
dc.contributor.author | Žitnik, Slavko |
dc.contributor.author | Robida, Nejc |
dc.contributor.author | Blagus, Neli |
dc.contributor.author | Može, Sara |
dc.contributor.author | Ledinek, Nina |
dc.contributor.author | Holz, Nanika |
dc.contributor.author | Zupan, Katja |
dc.contributor.author | Kuzman, Taja |
dc.contributor.author | Kavčič, Teja |
dc.contributor.author | Škrjanec, Iza |
dc.contributor.author | Marko, Dafne |
dc.contributor.author | Jezeršek, Lucija |
dc.contributor.author | Zajc, Anja |
dc.date.accessioned | 2022-12-05T11:47:15Z |
dc.date.available | 2022-12-05T11:47:15Z |
dc.date.issued | 2022-12-05 |
dc.identifier.uri | http://hdl.handle.net/11356/1747 |
dc.description | The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with some parts also containing further manually verified annotations. The morphosyntactic tags and (where present) syntactic dependencies are included both in the JOS/MULTEXT-East framework, as well as in the framework of Universal Dependencies. The corpus is composed of several parts: * ssj500k-syn (200,320 words): the syntactically annotated part of the updated ssj500k corpus 2.3 (http://hdl.handle.net/11356/1434), contains also named entity, verbal multiword expression and semantic role label annotations; * ssj500k-tag.xml (299,927 words): the PoS tagged part of the updated ssj500k corpus 2.3 (http://hdl.handle.net/11356/1434), contains also verbal multiword expressions annotations; * Ambiga (13,929 words): this corpus has been constructed to contain many potentially lemma/PoS ambiguous words in order to help in the training of taggers and lemmatizers * ElexisWSD (27,091 words): the Slovenian part of the "Parallel sense-annotated corpus ELEXIS-WSD 1.0" (http://hdl.handle.net/11356/1674) with manually checked lemmatisation, PoS tagging, and syntactic parses; contains also named entity and semantic role label annotations; * SentiCoref (340,401 words): the "Slovene corpus for aspect-based sentiment analysis - SentiCoref 1.0" (http://hdl.handle.net/11356/1285) with manually checked lemmatisation and PoS tagging; contains also named entity and coreference chain annotation. The annotations follow: (1) the MULTEXT-East V6 morphosyntactic specifications for Slovene, https://nl.ijs.si/ME/V6/msd/, (2) the JOS dependency schema, https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, (3) the Universal Dependencies morphosyntactic specifications and syntactic dependencies for Slovene-SSJ, https://universaldependencies.org/, (4) the Janes annotation guidelines for Slovenian named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, (5) the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/. The vocabulary of (1) is provided in the back element and (3)-(5) as taxonomies in the teiHeader of the TEI encoded corpus. The semantic role labels are also documented in the teiHeader. In contrast to the previous version ssj500k 2.3, this version has significantly more text, corrects various errors in annotation, annotates more text with syntactic parses, adds new types of annotation, updates the TEI encoding, provides CoNLL-U files with text metadata and distinguishes UD-type CoNLL-U files from JOS-type CoNLL-U files. |
dc.language.iso | slv |
dc.publisher | Centre for Language Resources and Technologies, University of Ljubljana |
dc.relation.replaces | http://hdl.handle.net/11356/1434 |
dc.relation.isreplacedby | http://hdl.handle.net/11356/1959 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | https://rsdo.slovenscina.eu/en/language-resources |
dc.subject | part-of-speech tagging |
dc.subject | dependency treebank |
dc.subject | parsing |
dc.subject | named entities |
dc.subject | tokenisation |
dc.subject | manual annotation |
dc.subject | TEI |
dc.subject | verbal multiword expressions |
dc.subject | semantic role labelling |
dc.subject | CONLL-U |
dc.subject | coreference resolution |
dc.title | Training corpus SUK 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute |
contact.person | Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
contact.person | Špela Arhar Holdt Spela.ArharHoldt@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana |
sponsor | Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other |
sponsor | ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds |
sponsor | ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other |
size.info | 2913 texts |
size.info | 48594 sentences |
size.info | 881668 words |
size.info | 1025639 tokens |
files.count | 2 |
files.size | 45235424 |
featuredService.kontext | search|https://www.clarin.si/kontext/query?corpname=suk10 |
featuredService.noske | search|https://www.clarin.si/ske/#dashboard?corpname=suk10 |
Files in this item
Download all files in item (43.14 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)




- Name
- SUK.TEI.zip
- Size
- 19.94 MB
- Format
- application/zip
- Description
- Corpus in TEI format
- MD5
- d9777bc76a5db135b6e9ee3cf12c9e71

- Name
- SUK.CoNLL-U.zip
- Size
- 23.2 MB
- Format
- application/zip
- Description
- Corpus in CoNLL-U format
- MD5
- 66b3db82d3356bbf80746d0b94e75d16
- SUK.CoNLL-U
- senticoref.ud.conllu28 MB
- ambiga.jos.conllu1 MB
- tei2conllu.xsl26 kB
- ssj500k-syn.ud.conllu17 MB
- ssj500k-tag.jos.conllu31 MB
- elexiswsd.ud.conllu2 MB
- ambiga.ud.conllu1 MB
- ssj500k-tag.ud.conllu23 MB
- senticoref.jos.conllu36 MB
- 00README.txt1 kB
- elexiswsd.jos.conllu3 MB
- ssj500k-syn.jos.conllu22 MB