Show simple item record

 
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Fišer, Darja
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Ferme, Marko
dc.contributor.author Borovič, Mladen
dc.contributor.author Boškovič, Borko
dc.contributor.author Ojsteršek, Milan
dc.contributor.author Hrovat, Goran
dc.date.accessioned 2019-12-24T14:43:31Z
dc.date.available 2019-12-24T14:43:31Z
dc.date.issued 2019-11-28
dc.identifier.uri http://hdl.handle.net/11356/1244
dc.description The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,7 billion tokens) written 2000 - 2018 and gathered from the digital libraries of Slovene higher education institutions via the Slovene Open Science portal (http://openscience.si/). The theses have associated with them significant metadata, while each thesis in the corpus contains its textual body, i.e. without their front and back matter. The body is divided into pages, these into paragraphs, and then into sentences. The sentence tokens are morphosyntactically annotated, words are lemmatised and English-Slovene pairs of term candidates are marked up and linked. The PhD theses in the corpus also have marked-up Slovene monolingual term candidates. The corpus is distributed in the canonical TEI encoding, in the so-called vertical format used by the (no)Sketch Engine and CWB concordancers, and as plain text files. Each format distribution also contains a file with thesis metadata. This repository entry contains the complete corpus; separate entries are available that contain only the PhD theses (KAS-dr: http://hdl.handle.net/11356/1265), the MSc/MA theses (KAS-mag: http://hdl.handle.net/11356/1266) and BSc/BA theses (KAS-dipl: http://hdl.handle.net/11356/1267).
dc.language.iso slv
dc.publisher Jožef Stefan Institute
dc.publisher Faculty of Electrical Engineering and Computer Science, University of Maribor
dc.relation.isreferencedby https://rdcu.be/b7GrB
dc.relation.isreplacedby http://hdl.handle.net/11356/1448
dc.rights CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
dc.rights.uri https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
dc.rights.label ACA
dc.source.uri http://nl.ijs.si/kas/
dc.subject PhD theses
dc.subject MSc/MA theses
dc.subject BSc/BA theses
dc.subject academic writing
dc.subject terminology
dc.subject TEI
dc.title Corpus of academic Slovene KAS 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tomaž Erjavec tomaz.erjavec@ijs.si Jožef Stefan Institute
sponsor ARRS (Slovenian Research Agency) J6-7094 Slovene scientific texts: resources and description nationalFunds
size.info 82308 texts
size.info 5048551 pages
size.info 1699097710 tokens
files.count 6
files.size 45217492350
featuredService.kontext search KAS|https://www.clarin.si/kontext/first_form?corpname=kas
featuredService.kontext search KAS-dipl|https://www.clarin.si/kontext/first_form?corpname=kas_dipl
featuredService.kontext search KAS-mag|https://www.clarin.si/kontext/first_form?corpname=kas_mag
featuredService.kontext search KAS-dr|https://www.clarin.si/kontext/first_form?corpname=kas_dr
featuredService.noske search KAS|https://www.clarin.si/ske/#dashboard?corpname=kas
featuredService.noske search KAS-dipl|https://www.clarin.si/ske/#dashboard?corpname=kas_dipl
featuredService.noske search KAS-mag|https://www.clarin.si/ske/#dashboard?corpname=kas_mag
featuredService.noske search KAS-dr|https://www.clarin.si/ske/#dashboard?corpname=kas_dr


 Files in this item

This item is
Academic Use
and licensed under:
CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
Inform Before Use Attribution Required Noncommercial
Icon
Name
kas.tei.tar.0.gz
Size
6.31 GB
Format
application/gzip
Description
Corpus in TEI format, slice 0
MD5
83ba9ba74c717c610582c874701c33cb
 Download file
Icon
Name
kas.tei.tar.1.gz
Size
6.31 GB
Format
application/gzip
Description
Corpus in TEI format, slice 1
MD5
3fee0c7ca0af3b48cbeba94852ac0f14
 Download file
Icon
Name
kas.tei.tar.2.gz
Size
5.91 GB
Format
application/gzip
Description
Corpus in TEI format, slice 2
MD5
3d3ee0a2366da3560397b8121d4e9616
 Download file
Icon
Name
kas.vert.tar.0.gz
Size
9.01 GB
Format
application/gzip
Description
Corpus in derived vertical format, slice 0
MD5
23eba94d471db62c8e04caf6c647d903
 Download file
Icon
Name
kas.vert.tar.1.gz
Size
8.48 GB
Format
application/gzip
Description
Corpus in derived vertical format, slice 1
MD5
41e5711931a0a026048d82dd217cd3db
 Download file
Icon
Name
kas.txt.tar.gz
Size
6.1 GB
Format
application/gzip
Description
Corpus in plain text format
MD5
f6c07dc11f6a5c2d6e9d7bec8ef080af
 Download file

Show simple item record