Parallel corpus of idiomatic text ParaDiom 1.0

Name: Parallel corpus of idiomatic text ParaDiom 1.0
License: https://creativecommons.org/licenses/by-nc-sa/4.0/

Donaj, Gregor; Antloga, Špela

Prikaži enostavni zapis vnosa

dc.contributor.author	Donaj, Gregor
dc.contributor.author	Antloga, Špela
dc.date.accessioned	2022-11-19T09:28:44Z
dc.date.available	2022-11-19T09:28:44Z
dc.date.issued	2022-11-15
dc.identifier.uri	http://hdl.handle.net/11356/1714
dc.description	ParaDiom is a parallel corpus with sentences sampled from existing corpora. The corpus contains 1,000 Slovene sentences with their English translation and 1,000 English sentences with their Slovene translations. The sampled sentences contain idioms, similes, and proverbs, which are annotated in the corpus. Sentences were sampled based on a selection of 100 Slovene and 92 English idioms and similes by searching through sentences in the corpora ccGigafida (http://hdl.handle.net/11356/1035), ParlaMint (http://hdl.handle.net/11356/1431), and The Corpus of Late Modern English Texts (http://fedora.clarin-d.uni-saarland.de/clmet/clmet.html). All sampled sentences were tagged with MULTEXT-East MSD tags, Universal Dependencies morphological features and lemmas using Stanza (https://github.com/stanfordnlp/stanza) for English and CLASSLA for Slovene (https://github.com/clarinsi/classla) sentences. Some idioms were found as part of proverbs, which were also annotated. Half of the sampled sentences were translated by hand, and the other half were translated using machine translation and post-editing. We used the Q-CAT annotation tool (http://hdl.handle.net/11356/1262) to annotate the idiomatic expressions. The annotated noun, adjective and adverbial idioms were given the label MWE ID (‘idiomatic multiword expression’), verb idioms MWE VID (‘verbal idiomatic multiword expression’), similes MWE SIM (‘simile’), and proverbs MWE P (‘proverb’).
dc.language.iso	slv
dc.language.iso	eng
dc.publisher	Faculty of Electrical Engineering and Computer Science, University of Maribor
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label	PUB
dc.subject	parallel corpus
dc.subject	TEI
dc.subject	idiomatic expressions
dc.title	Parallel corpus of idiomatic text ParaDiom 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Gregor Donaj gregor.donaj@um.si Faculty of Electrical Engineering and Computer Science, University of Maribor
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	2933 idiomaticExpressions
size.info	66413 words
size.info	2000 translationUntis
files.count	1
files.size	1173167

Datoteke v tem vnosu

To je vnos

Publicly Available

z licenco:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Ime: ParaDiom.TEI.zip
Velikost: 1.12 MB
Format: application/zip
Opis: Corpus in TEI format
MD5: f9e07bb9d0e8ae6eae3bb23456fc448d

Prenesi datoteko Predogled

Predogled datoteke

ParaDiom.TEI
- ParaDiom-sl-2.xml1 MB
- ParaDiom-sl-1.xml1 MB
- schema
  - tei_clarin_schema.xml70 kB
  - tei_clarin_example.xml48 kB
  - tei_clarin.rnc311 kB
  - README.md525 B
  - tei_clarin.rng654 kB
- ParaDiom-en-4.xml1 MB
- ParaDiom-en-3.xml1 MB
- ParaDiom-en-2.xml1 MB
- mapping.tsv91 kB
- ParaDiom-en-1.xml1 MB
- 00README.txt1 kB
- ParaDiom-sl-4.xml1 MB
- ParaDiom-sl-3.xml1 MB
- ParaDiom.xml14 kB

Prikaži enostavni zapis vnosa

Datoteke v tem vnosu

Partnerji

Partnerji

Repozitorij