dc.contributor.author |
Erjavec, Tomaž |
dc.contributor.author |
Fišer, Darja |
dc.contributor.author |
Ljubešić, Nikola |
dc.contributor.author |
Ferme, Marko |
dc.contributor.author |
Borovič, Mladen |
dc.contributor.author |
Boškovič, Borko |
dc.contributor.author |
Ojsteršek, Milan |
dc.contributor.author |
Hrovat, Goran |
dc.date.accessioned |
2019-12-24T14:43:31Z |
dc.date.available |
2019-12-24T14:43:31Z |
dc.date.issued |
2019-11-28 |
dc.identifier.uri |
http://hdl.handle.net/11356/1244 |
dc.description |
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,7 billion tokens) written 2000 - 2018 and gathered from the digital libraries of Slovene higher education institutions via the Slovene Open Science portal (http://openscience.si/).
The theses have associated with them significant metadata, while each thesis in the corpus contains its textual body, i.e. without their front and back matter. The body is divided into pages, these into paragraphs, and then into sentences. The sentence tokens are morphosyntactically annotated, words are lemmatised and English-Slovene pairs of term candidates are marked up and linked. The PhD theses in the corpus also have marked-up Slovene monolingual term candidates.
The corpus is distributed in the canonical TEI encoding, in the so-called vertical format used by the (no)Sketch Engine and CWB concordancers, and as plain text files. Each format distribution also contains a file with thesis metadata.
This repository entry contains the complete corpus; separate entries are available that contain only the PhD theses (KAS-dr: http://hdl.handle.net/11356/1265), the MSc/MA theses (KAS-mag: http://hdl.handle.net/11356/1266) and BSc/BA theses (KAS-dipl: http://hdl.handle.net/11356/1267). |
dc.language.iso |
slv |
dc.publisher |
Jožef Stefan Institute |
dc.publisher |
Faculty of Electrical Engineering and Computer Science, University of Maribor |
dc.relation.isreferencedby |
https://rdcu.be/b7GrB |
dc.relation.isreplacedby |
http://hdl.handle.net/11356/1448 |
dc.rights |
CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0 |
dc.rights.uri |
https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0 |
dc.rights.label |
ACA |
dc.source.uri |
http://nl.ijs.si/kas/ |
dc.subject |
PhD theses |
dc.subject |
MSc/MA theses |
dc.subject |
BSc/BA theses |
dc.subject |
academic writing |
dc.subject |
terminology |
dc.subject |
TEI |
dc.subject |
scientific texts |
dc.title |
Corpus of academic Slovene KAS 1.0 |
dc.type |
corpus |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
has.files |
yes |
branding |
CLARIN.SI data & tools |
contact.person |
Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute |
sponsor |
ARRS (Slovenian Research Agency) J6-7094 Slovene scientific texts: resources and description nationalFunds |
size.info |
82308 texts |
size.info |
5048551 pages |
size.info |
1699097710 tokens |
files.count |
6 |
files.size |
45217492350 |
featuredService.kontext |
search KAS|https://www.clarin.si/kontext/first_form?corpname=kas |
featuredService.kontext |
search KAS-dipl|https://www.clarin.si/kontext/first_form?corpname=kas_dipl |
featuredService.kontext |
search KAS-mag|https://www.clarin.si/kontext/first_form?corpname=kas_mag |
featuredService.kontext |
search KAS-dr|https://www.clarin.si/kontext/first_form?corpname=kas_dr |
featuredService.noske |
search KAS|https://www.clarin.si/ske/#dashboard?corpname=kas |
featuredService.noske |
search KAS-dipl|https://www.clarin.si/ske/#dashboard?corpname=kas_dipl |
featuredService.noske |
search KAS-mag|https://www.clarin.si/ske/#dashboard?corpname=kas_mag |
featuredService.noske |
search KAS-dr|https://www.clarin.si/ske/#dashboard?corpname=kas_dr |