Prikaži enostavni zapis vnosa

 
dc.contributor.author Bañón, Marta
dc.contributor.author Chichirau, Malina
dc.contributor.author Esplà-Gomis, Miquel
dc.contributor.author Forcada, Mikel L.
dc.contributor.author Galiano-Jiménez, Aarón
dc.contributor.author García-Romero, Cristian
dc.contributor.author Kuzman, Taja
dc.contributor.author Ljubešić, Nikola
dc.contributor.author van Noord, Rik
dc.contributor.author Pla Sempere, Leopoldo
dc.contributor.author Ramírez-Sánchez, Gema
dc.contributor.author Rupnik, Peter
dc.contributor.author Suchomel, Vít
dc.contributor.author Toral, Antonio
dc.contributor.author Zaragoza-Bernabeu, Jaume
dc.date.accessioned 2023-04-25T13:50:46Z
dc.date.available 2023-04-25T13:50:46Z
dc.date.issued 2023-04-26
dc.identifier.uri http://hdl.handle.net/11356/1819
dc.description The Serbian-English parallel corpus MaCoCu-sr-en 1.0 was built by crawling the “.rs” and “.срб” internet top-level domains in 2021 and 2022, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable effort was devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and Bifixer (https://github.com/bitextor/bifixer) and BicleanerAI (https://github.com/bitextor/bicleaner-ai) were used for fixing, cleaning, and deduplicating the final version of the corpus. The corpus is available in three formats: two sentence-level formats, TXT and TMX, and a document-level TXT format. In each format, the texts are separated based on the script into two files: a Latin and a Cyrillic subcorpus. TMX is an XML-based format and TXT is a tab-separated format. They both consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included in both sentence-level formats: - source and target document URL; - paragraph ID which includes information on the position of the sentence in the paragraph and in the document (e.g., “p35:77s1/3” which means “paragraph 35 out of 77, sentence 1 out of 3”); - quality score as provided by the tool Bicleaner AI (a likelihood of a pair of sentences being mutual translations, provided with a score between 0 and 1); - similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1); - personal information identification (“biroamer-entities-detected”): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - translation direction and machine translation identification (“translation-direction”): the source segment in each segment pair was identified by using a probabilistic model (https://github.com/RikVN/TranslationDirection), which also determines if the translation has been produced by a machine-translation system; - a DSI class (“dsi”): information whether the segment is connected to any of Digital Service Infrastructure (DSI) classes (e.g., cybersecurity, e-health, e-justice, open-data-portal), defined by the Connecting Europe Facility (https://github.com/RikVN/DSI); - English language variant: the language variant of English (British or American, using a lexicon-based English variety classifier - https://pypi.org/project/abclf/) was identified on document and domain level. Furthermore, the sentence-level TXT format provides additional metadata: - web domain of the text; - source and target document title; - the date when the original file was retrieved; - the original type of the file (e.g., “html”), from which the sentence was extracted; - paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext); - information whether the sentence is a heading or not in the original document. The document-level TXT format provides pairs of documents identified to contain parallel data. In addition to the parallel documents (in base64 format), the corpus includes the following metadata: source and target document URL, a DSI category and the English language variant (British or American). Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.
dc.language.iso srp
dc.language.iso eng
dc.publisher Jožef Stefan Institute
dc.publisher Prompsit
dc.publisher Rijksuniversiteit Groningen
dc.publisher Universitat d'Alacant
dc.relation.isreferencedby https://hdl.handle.net/11370/685514a8-947e-44f9-83cf-90356c5f1684
dc.rights CC0-No Rights Reserved
dc.rights.uri https://creativecommons.org/publicdomain/zero/1.0/
dc.rights.label PUB
dc.source.uri https://macocu.eu/
dc.subject web corpus
dc.subject parallel corpus
dc.subject multilingual
dc.title Serbian-English parallel corpus MaCoCu-sr-en 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Miquel Esplà-Gomis mespla@dlsi.ua.es Universitat d’Alacant
sponsor Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 2068916 entries
size.info 95863993 words
size.info 312335 texts
files.count 6
files.size 2393326773


 Datoteke v tem vnosu

To je vnos
Publicly Available
z licenco:
CC0-No Rights Reserved
Icon
Ime
MaCoCu-sr-en.cyrillic.doc.txt.gz
Velikost
310.83 MB
Format
application/gzip
Opis
Corpus in document-level TXT format, cyrillic
MD5
25ccdf72780198f215a6823988a56503
 Prenesi datoteko
Icon
Ime
MaCoCu-sr-en.cyrillic.sent.txt.gz
Velikost
173.36 MB
Format
application/gzip
Opis
Corpus in sentence-level TXT format, cyrillic
MD5
25b9c639b553dd923bbbae3d6e8b430d
 Prenesi datoteko
Icon
Ime
MaCoCu-sr-en.cyrillic.tmx.gz
Velikost
164.37 MB
Format
application/gzip
Opis
Corpus in sentence-level TMX format, cyrillic
MD5
2da3bc6601b4a36123d22909f0715c70
 Prenesi datoteko
Icon
Ime
MaCoCu-sr-en.latin.doc.txt.gz
Velikost
669.96 MB
Format
application/gzip
Opis
Corpus in document-level TXT format, latin
MD5
a596cbac6857033dc5ab28786bbd58ca
 Prenesi datoteko
Icon
Ime
MaCoCu-sr-en.latin.sent.txt.gz
Velikost
499.71 MB
Format
application/gzip
Opis
Corpus in sentence-level TXT format, latin
MD5
adf90151f028e34e3669dc0703a11aa1
 Prenesi datoteko
Icon
Ime
MaCoCu-sr-en.latin.tmx.gz
Velikost
464.24 MB
Format
application/gzip
Opis
Corpus in sentence-level TMX format, latin
MD5
13bddfdc9ceea58a374947a08ef0fd73
 Prenesi datoteko

Prikaži enostavni zapis vnosa