dc.contributor.author | Donaj, Gregor |
dc.contributor.author | Antloga, Špela |
dc.date.accessioned | 2022-11-19T09:28:44Z |
dc.date.available | 2022-11-19T09:28:44Z |
dc.date.issued | 2022-11-15 |
dc.identifier.uri | http://hdl.handle.net/11356/1714 |
dc.description | ParaDiom is a parallel corpus with sentences sampled from existing corpora. The corpus contains 1,000 Slovene sentences with their English translation and 1,000 English sentences with their Slovene translations. The sampled sentences contain idioms, similes, and proverbs, which are annotated in the corpus. Sentences were sampled based on a selection of 100 Slovene and 92 English idioms and similes by searching through sentences in the corpora ccGigafida (http://hdl.handle.net/11356/1035), ParlaMint (http://hdl.handle.net/11356/1431), and The Corpus of Late Modern English Texts (http://fedora.clarin-d.uni-saarland.de/clmet/clmet.html). All sampled sentences were tagged with MULTEXT-East MSD tags, Universal Dependencies morphological features and lemmas using Stanza (https://github.com/stanfordnlp/stanza) for English and CLASSLA for Slovene (https://github.com/clarinsi/classla) sentences. Some idioms were found as part of proverbs, which were also annotated. Half of the sampled sentences were translated by hand, and the other half were translated using machine translation and post-editing. We used the Q-CAT annotation tool (http://hdl.handle.net/11356/1262) to annotate the idiomatic expressions. The annotated noun, adjective and adverbial idioms were given the label MWE ID (‘idiomatic multiword expression’), verb idioms MWE VID (‘verbal idiomatic multiword expression’), similes MWE SIM (‘simile’), and proverbs MWE P (‘proverb’). |
dc.language.iso | slv |
dc.language.iso | eng |
dc.publisher | Faculty of Electrical Engineering and Computer Science, University of Maribor |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.label | PUB |
dc.subject | parallel corpus |
dc.subject | TEI |
dc.subject | idiomatic expressions |
dc.title | Parallel corpus of idiomatic text ParaDiom 1.0 |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Gregor Donaj gregor.donaj@um.si Faculty of Electrical Engineering and Computer Science, University of Maribor |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
size.info | 2933 idiomaticExpressions |
size.info | 66413 words |
size.info | 2000 translationUntis |
files.count | 1 |
files.size | 1173167 |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)





- Ime
- ParaDiom.TEI.zip
- Velikost
- 1.12 MB
- Format
- application/zip
- Opis
- Corpus in TEI format
- MD5
- f9e07bb9d0e8ae6eae3bb23456fc448d
- ParaDiom.TEI
- ParaDiom-sl-2.xml1 MB
- ParaDiom-sl-1.xml1 MB
- schema
- tei_clarin_schema.xml70 kB
- tei_clarin_example.xml48 kB
- tei_clarin.rnc311 kB
- README.md525 B
- tei_clarin.rng654 kB
- ParaDiom-en-4.xml1 MB
- ParaDiom-en-3.xml1 MB
- ParaDiom-en-2.xml1 MB
- mapping.tsv91 kB
- ParaDiom-en-1.xml1 MB
- 00README.txt1 kB
- ParaDiom-sl-4.xml1 MB
- ParaDiom-sl-3.xml1 MB
- ParaDiom.xml14 kB