Show simple item record

 
dc.contributor.author Žagar, Aleš
dc.contributor.author Kavaš, Matic
dc.contributor.author Robnik-Šikonja, Marko
dc.contributor.author Erjavec, Tomaž
dc.contributor.author Fišer, Darja
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Ferme, Marko
dc.contributor.author Borovič, Mladen
dc.contributor.author Boškovič, Borko
dc.contributor.author Ojsteršek, Milan
dc.contributor.author Hrovat, Goran
dc.date.accessioned 2022-02-10T09:35:07Z
dc.date.available 2022-02-10T09:35:07Z
dc.date.issued 2022-02-04
dc.identifier.uri http://hdl.handle.net/11356/1447
dc.description The Machine Translation datasets KAS-MT 1.0 contain automatically sentence-aligned Slovene and English plain-text abstracts from KAS-Abs 2.0 (http://hdl.handle.net/11356/1449) and is meant for studies in machine translation. The setence alignment approach used requires an alignment reliability threshold that omits candidate pairs below a certain value. This value represents a trade-off between the quantity and quality of aligned pairs. We estimate that the default threshold value produces a good-quality dataset for most users. We release three such datasets (files) that reflect a trade-off between quality and quantity of the data. The Normal dataset uses the default reliability threshold and contains 496,102 sentence pairs, the Strict dataset 474,852 sentence pairs, and the Very Strict dataset 425,534 sentence pairs. A file with thesis metadata is also included. The first column in each of the three TSV files gives the confidence that the alignment is correct (higher is better), the second and third are the source and target Slovene and English sentences, while the fourth gives the “merged” state, i.e. whether sentences in the source or target language were merged (sentences do not always exhibit one-to-one mapping). The last column gives the thesis ID. Reference: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228
dc.language.iso slv
dc.language.iso eng
dc.publisher Faculty of Electrical Engineering and Computer Science, University of Maribor
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.relation info:eu-repo/grantAgreement/EC/H2020/825153
dc.relation.isreferencedby https://doi.org/10.5281/zenodo.5562228
dc.rights CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
dc.rights.uri https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
dc.rights.label ACA
dc.source.uri https://nl.ijs.si/kas/
dc.subject academic writing
dc.subject PhD theses
dc.subject MSc/MA theses
dc.subject BSc/BA theses
dc.subject machine translation
dc.title Machine Translation datasets from the KAS corpus KAS-MT 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Aleš Žagar Ales.Zagar@fri.uni-lj.si Faculty of Computer and Information Science
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
sponsor Ministry of Culture C3340-20-278001 Development of Slovene in a Digital Environment Other
sponsor European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
sponsor ARRS (Slovenian Research Agency) J6-2581 Računalniško podprta večjezična analiza novičarskega diskurza s kontekstualnimi besednimi vložitvami nationalFunds
size.info 21902039 words
files.count 1
files.size 190986124


 Files in this item

This item is
Academic Use
and licensed under:
CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0
Inform Before Use Attribution Required Noncommercial
Icon
Name
kas.mt.tar.gz
Size
182.14 MB
Format
application/gzip
Description
Machine translation datasets
MD5
52a93771490aecb58832d38ddcff2e2e
 Download file

Show simple item record