Show simple item record

 
dc.contributor.author Žagar, Aleš
dc.contributor.author Kavaš, Matic
dc.contributor.author Robnik-Šikonja, Marko
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Fišer, Darja
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Ferme, Marko
dc.contributor.author Borovič, Mladen
dc.contributor.author Boškovič, Borko
dc.contributor.author Ojsteršek, Milan
dc.contributor.author Hrovat, Goran
dc.date.accessioned 2022-02-04T16:50:50Z
dc.date.available 2022-02-04T16:50:50Z
dc.date.issued 2022-02-04
dc.identifier.uri http://hdl.handle.net/11356/1448
dc.description The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,5 billion tokens) written 2000 - 2018 and gathered from the digital libraries of Slovene higher education institutions via the Slovene Open Science portal (http://openscience.si/). The theses have associated with them significant metadata, while each thesis in the corpus contains its textual body, i.e. without their front and back matter. The body is divided into chapters, then into pages, these into paragraphs, and then into sentences. The sentence tokens are tagged with morphosyntactically descriptions (detailed part-of-speech tags) and the words lemmatised. As opposed to the previous version 1.0, the KAS corpus of Slovene academic writing 2.0 is cleaner and contains segmentations into chapters. The metadata also contains more information about research fields of each work. Both versions consist of the same number of BSc/BA, MSc/MA, and PhD theses, however, the processing was done from scratch for 2.0, so the number of e.g. pages and tokens is different. Note also that the new version does not contain links to the PNG pictures of individual pages , nor does it contain annotated terms, both present in version 1.0. It is, unlike 1.0, also not mounted on the CLARIN.SI concordancers. The new version is distributed in the canonical TEI encoding, JSON, and as plain text files. In the TEI format, chapter names are denoted with the <head> tag. Each entry in JSON files have a string ID and a list containing names of chapters as its first element and texts as its second element. Chapters without text are represented as an empty string. The plain text files contain only text bodies without segmentation information. References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228
dc.language.iso slv
dc.publisher Faculty of Electrical Engineering and Computer Science, University of Maribor
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.relation info:eu-repo/grantAgreement/EC/H2020/825153
dc.relation.isreferencedby https://doi.org/10.5281/zenodo.5562228
dc.relation.replaces http://hdl.handle.net/11356/1244
dc.rights CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
dc.rights.uri https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
dc.rights.label ACA
dc.source.uri https://nl.ijs.si/kas/
dc.subject PhD theses
dc.subject MSc/MA theses
dc.subject BSc/BA theses
dc.subject academic writing
dc.subject TEI
dc.title Corpus of academic Slovene KAS 2.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Aleš Žagar Ales.Zagar@fri.uni-lj.si Faculty of Computer and Information Science
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
sponsor European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
sponsor ARRS (Slovenian Research Agency) J6-2581 Računalniško podprta večjezična analiza novičarskega diskurza s kontekstualnimi besednimi vložitvami nationalFunds
size.info 82308 texts
size.info 4780517 pages
size.info 1496079001 tokens
files.count 4
files.size 14716737837


 Files in this item

This item is
Academic Use
and licensed under:
CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
Inform Before Use Attribution Required Noncommercial
Icon
Name
kas.tei.tar.gz
Size
8.16 GB
Format
application/gzip
Description
Corpus in source TEI format
MD5
0e450329c16cf2adaa04f3a94356a476
 Download file
Icon
Name
kas.json.tar.gz
Size
2.78 GB
Format
application/gzip
Description
Corpus in JSON format
MD5
1e2e2e88f16d40cdae79d56e1568efd1
 Download file
Icon
Name
kas-meta.tsv.gz
Size
11.6 MB
Format
application/gzip
Description
Per-document TSV metadata of the corpus
MD5
a81810aa6737425c43d2a44cac349994
 Download file
Icon
Name
kas.txt.tar.gz
Size
2.76 GB
Format
application/gzip
Description
Corpus in plain text format
MD5
f31b94c9d56e666375a9db76b692e7c9
 Download file

Show simple item record