Bulgarian-English parallel corpus MaCoCu-bg-en 1.0

Name: Bulgarian-English parallel corpus MaCoCu-bg-en 1.0
License: https://creativecommons.org/publicdomain/zero/1.0/

Bañón, Marta; Esplà-Gomis, Miquel; Forcada, Mikel L.; García-Romero, Cristian; Kuzman, Taja; Ljubešić, Nikola; van Noord, Rik; Pla Sempere, Leopoldo; Ramírez-Sánchez, Gema; Rupnik, Peter; Suchomel, Vít; Toral, Antonio; van der Werff, Tobias; Zaragoza, Jaume

Show simple item record

dc.contributor.author	Bañón, Marta
dc.contributor.author	Esplà-Gomis, Miquel
dc.contributor.author	Forcada, Mikel L.
dc.contributor.author	García-Romero, Cristian
dc.contributor.author	Kuzman, Taja
dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	van Noord, Rik
dc.contributor.author	Pla Sempere, Leopoldo
dc.contributor.author	Ramírez-Sánchez, Gema
dc.contributor.author	Rupnik, Peter
dc.contributor.author	Suchomel, Vít
dc.contributor.author	Toral, Antonio
dc.contributor.author	van der Werff, Tobias
dc.contributor.author	Zaragoza, Jaume
dc.date.accessioned	2022-04-28T07:33:31Z
dc.date.available	2022-04-28T07:33:31Z
dc.date.issued	2022-04-28
dc.identifier.uri	http://hdl.handle.net/11356/1521
dc.description	The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable efforts were devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and BicleanerAI (https://github.com/bitextor/bicleaner-ai) and Bifixer (https://github.com/bitextor/bifixer) were used for fixing, cleaning, and deduplicating the final version of the corpus. While the TXT format consists solely of pairs of source and target segments (one or several sentences), each segment pair in the TMX format is accompanied by the following metadata: - source and target document URL; - quality score as provided by the tool BicleanerAI; - translation direction identification: the source segment in each segment pair was identified by using a probabilistic model; - personal information identification (“biroamer-entities”): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - language variants: the language variant of English (British or American) was identified for every segment pair on document and domain level. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.
dc.language.iso	bul
dc.language.iso	eng
dc.publisher	Jožef Stefan Institute
dc.publisher	Prompsit
dc.publisher	Rijksuniversiteit Groningen
dc.publisher	Universitat d'Alacant
dc.relation.isreplacedby	http://hdl.handle.net/11356/1815
dc.rights	CC0-No Rights Reserved
dc.rights.uri	https://creativecommons.org/publicdomain/zero/1.0/
dc.rights.label	PUB
dc.source.uri	https://macocu.eu/
dc.subject	web corpus
dc.subject	parallel corpus
dc.subject	multilingual
dc.title	Bulgarian-English parallel corpus MaCoCu-bg-en 1.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Miquel Esplà-Gomis mespla@dlsi.ua.es Universitat d’Alacant
sponsor	Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	158726260 words
size.info	3857653 entries
files.count	2
files.size	1638042069

Files in this item

This item is

Publicly Available

and licensed under:
CC0-No Rights Reserved

Name: MaCoCu-bg-en.tmx.gz
Size: 1.04 GB
Format: application/gzip
Description: Corpus in TMX format
MD5: a064398357bd7678e3f00d0d53ff3e33

Download file

Name: MaCoCu-bg-en.txt.gz
Size: 498.27 MB
Format: application/gzip
Description: Corpus in plain text format
MD5: 835654c03e5cfc23767dfa94e5425305

Download file

Show simple item record

Files in this item

Partners

Partners

Repository