Training corpus SUK 1.0

Name: Training corpus SUK 1.0
License: https://creativecommons.org/licenses/by-sa/4.0/

Arhar Holdt, Špela; Krek, Simon; Dobrovoljc, Kaja; Erjavec, Tomaž; Gantar, Polona; Čibej, Jaka; Pori, Eva; Terčon, Luka; Munda, Tina; Žitnik, Slavko; Robida, Nejc; Blagus, Neli; Može, Sara; Ledinek, Nina; Holz, Nanika; Zupan, Katja; Kuzman, Taja; Kavčič, Teja; Škrjanec, Iza; Marko, Dafne; Jezeršek, Lucija; Zajc, Anja

dc.contributor.author	Arhar Holdt, Špela
dc.contributor.author	Krek, Simon
dc.contributor.author	Dobrovoljc, Kaja
dc.contributor.author	Erjavec, Tomaž
dc.contributor.author	Gantar, Polona
dc.contributor.author	Čibej, Jaka
dc.contributor.author	Pori, Eva
dc.contributor.author	Terčon, Luka
dc.contributor.author	Munda, Tina
dc.contributor.author	Žitnik, Slavko
dc.contributor.author	Robida, Nejc
dc.contributor.author	Blagus, Neli
dc.contributor.author	Može, Sara
dc.contributor.author	Ledinek, Nina
dc.contributor.author	Holz, Nanika
dc.contributor.author	Zupan, Katja
dc.contributor.author	Kuzman, Taja
dc.contributor.author	Kavčič, Teja
dc.contributor.author	Škrjanec, Iza
dc.contributor.author	Marko, Dafne
dc.contributor.author	Jezeršek, Lucija
dc.contributor.author	Zajc, Anja
dc.date.accessioned	2022-12-05T11:47:15Z
dc.date.available	2022-12-05T11:47:15Z
dc.date.issued	2022-12-05
dc.identifier.uri	http://hdl.handle.net/11356/1747
dc.description	The SUK training corpus contains about 1 million tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation, with some parts also containing further manually verified annotations. The morphosyntactic tags and (where present) syntactic dependencies are included both in the JOS/MULTEXT-East framework, as well as in the framework of Universal Dependencies. The corpus is composed of several parts: * ssj500k-syn (200,320 words): the syntactically annotated part of the updated ssj500k corpus 2.3 (http://hdl.handle.net/11356/1434), contains also named entity, verbal multiword expression and semantic role label annotations; * ssj500k-tag.xml (299,927 words): the PoS tagged part of the updated ssj500k corpus 2.3 (http://hdl.handle.net/11356/1434), contains also verbal multiword expressions annotations; * Ambiga (13,929 words): this corpus has been constructed to contain many potentially lemma/PoS ambiguous words in order to help in the training of taggers and lemmatizers * ElexisWSD (27,091 words): the Slovenian part of the "Parallel sense-annotated corpus ELEXIS-WSD 1.0" (http://hdl.handle.net/11356/1674) with manually checked lemmatisation, PoS tagging, and syntactic parses; contains also named entity and semantic role label annotations; * SentiCoref (340,401 words): the "Slovene corpus for aspect-based sentiment analysis - SentiCoref 1.0" (http://hdl.handle.net/11356/1285) with manually checked lemmatisation and PoS tagging; contains also named entity and coreference chain annotation. The annotations follow: (1) the MULTEXT-East V6 morphosyntactic specifications for Slovene, https://nl.ijs.si/ME/V6/msd/, (2) the JOS dependency schema, https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, (3) the Universal Dependencies morphosyntactic specifications and syntactic dependencies for Slovene-SSJ, https://universaldependencies.org/, (4) the Janes annotation guidelines for Slovenian named entities, https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, (5) the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/. The vocabulary of (1) is provided in the back element and (3)-(5) as taxonomies in the teiHeader of the TEI encoded corpus. The semantic role labels are also documented in the teiHeader. In contrast to the previous version ssj500k 2.3, this version has significantly more text, corrects various errors in annotation, annotates more text with syntactic parses, adds new types of annotation, updates the TEI encoding, provides CoNLL-U files with text metadata and distinguishes UD-type CoNLL-U files from JOS-type CoNLL-U files.
dc.language.iso	slv
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.relation.replaces	http://hdl.handle.net/11356/1434
dc.relation.isreplacedby	http://hdl.handle.net/11356/1959
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	https://rsdo.slovenscina.eu/en/language-resources
dc.subject	part-of-speech tagging
dc.subject	dependency treebank
dc.subject	parsing
dc.subject	named entities
dc.subject	tokenisation
dc.subject	manual annotation
dc.subject	TEI
dc.subject	verbal multiword expressions
dc.subject	semantic role labelling
dc.subject	CONLL-U
dc.subject	coreference resolution
dc.title	Training corpus SUK 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Simon Krek simon.krek@guest.arnes.si Jožef Stefan Institute
contact.person	Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
contact.person	Špela Arhar Holdt Spela.ArharHoldt@ff.uni-lj.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor	Ministry of Education, Science and Sport 3311-08-986003 Communication in Slovene Other
sponsor	ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor	ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
sponsor	ARRS (Slovenian Research Agency) MR-37487 Young Researcher Programme nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
size.info	2913 texts
size.info	48594 sentences
size.info	881668 words
size.info	1025639 tokens
files.count	2
files.size	45235424
featuredService.kontext	search\|https://www.clarin.si/kontext/query?corpname=suk10
featuredService.noske	search\|https://www.clarin.si/ske/#dashboard?corpname=suk10

Files in this item

Download all files in item (43.14 MB)

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: SUK.TEI.zip
Size: 19.94 MB
Format: application/zip
Description: Corpus in TEI format
MD5: d9777bc76a5db135b6e9ee3cf12c9e71

Download file Preview

File Preview

SUK.TEI
- SUK.xml65 kB
- senticoref.xml62 MB
- schema
  - tei_clarin.zip87 kB
  - tei_clarin.rnc282 kB
  - tei_clarin_schema.xml3 kB
  - tei_clarin.dtd229 kB
  - tei_clarin_doc.html7 MB
  - tei_clarin.rng579 kB
- ssj500k-tag.xml42 MB
- elexiswsd.xml11 MB
- ssj500k-syn.xml69 MB
- mte-fvlib.xml1 MB
- 00README.txt1 kB
- ambiga.xml2 MB

Name: SUK.CoNLL-U.zip
Size: 23.2 MB
Format: application/zip
Description: Corpus in CoNLL-U format
MD5: 66b3db82d3356bbf80746d0b94e75d16

Download file Preview

File Preview

SUK.CoNLL-U
- senticoref.ud.conllu28 MB
- ambiga.jos.conllu1 MB
- tei2conllu.xsl26 kB
- ssj500k-syn.ud.conllu17 MB
- ssj500k-tag.jos.conllu31 MB
- elexiswsd.ud.conllu2 MB
- ambiga.ud.conllu1 MB
- ssj500k-tag.ud.conllu23 MB
- senticoref.jos.conllu36 MB
- 00README.txt1 kB
- elexiswsd.jos.conllu3 MB
- ssj500k-syn.jos.conllu22 MB

Show simple item record

Files in this item

Partners

Partners

Repository