Show simple item record Bañón, Marta Esplà-Gomis, Miquel Forcada, Mikel L. García-Romero, Cristian Kuzman, Taja Ljubešić, Nikola van Noord, Rik Pla Sempere, Leopoldo Ramírez-Sánchez, Gema Rupnik, Peter Suchomel, Vít Toral, Antonio van der Werff, Tobias Zaragoza, Jaume 2022-04-28T07:33:31Z 2022-04-28T07:33:31Z 2022-04-28
dc.description The Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 was built by crawling the ".bg" and ".бг" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler ( Websites containing documents in both target languages were identified and processed using the tool Bitextor ( Considerable efforts were devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and BicleanerAI ( and Bifixer ( were used for fixing, cleaning, and deduplicating the final version of the corpus. While the TXT format consists solely of pairs of source and target segments (one or several sentences), each segment pair in the TMX format is accompanied by the following metadata: - source and target document URL; - quality score as provided by the tool BicleanerAI; - translation direction identification: the source segment in each segment pair was identified by using a probabilistic model; - personal information identification (“biroamer-entities”): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - language variants: the language variant of English (British or American) was identified for every segment pair on document and domain level. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.
dc.language.iso bul
dc.language.iso eng
dc.publisher Jožef Stefan Institute
dc.publisher Prompsit
dc.publisher Rijksuniversiteit Groningen
dc.publisher Universitat d'Alacant
dc.rights CC0-No Rights Reserved
dc.rights.label PUB
dc.subject web corpus
dc.subject parallel corpus
dc.subject multilingual
dc.title Bulgarian-English parallel corpus MaCoCu-bg-en 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Miquel Esplà-Gomis Universitat d’Alacant
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds 158726260 words 3857653 entries
files.count 2
files.size 1638042069

 Files in this item

This item is
Publicly Available
and licensed under:
CC0-No Rights Reserved
1.04 GB
Corpus in TMX format
 Download file
498.27 MB
Corpus in plain text format
 Download file

Show simple item record