Turkish web corpus MaCoCu-tr 2.0

Name: Turkish web corpus MaCoCu-tr 2.0
License: https://creativecommons.org/publicdomain/zero/1.0/
Keywords: web corpus

Bañón, Marta; Chichirau, Malina; Esplà-Gomis, Miquel; Forcada, Mikel L.; Galiano-Jiménez, Aarón; García-Romero, Cristian; Kuzman, Taja; Ljubešić, Nikola; van Noord, Rik; Pla Sempere, Leopoldo; Ramírez-Sánchez, Gema; Rupnik, Peter; Suchomel, Vít; Toral, Antonio; Zaragoza-Bernabeu, Jaume

Show simple item record

dc.contributor.author	Bañón, Marta
dc.contributor.author	Chichirau, Malina
dc.contributor.author	Esplà-Gomis, Miquel
dc.contributor.author	Forcada, Mikel L.
dc.contributor.author	Galiano-Jiménez, Aarón
dc.contributor.author	García-Romero, Cristian
dc.contributor.author	Kuzman, Taja
dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	van Noord, Rik
dc.contributor.author	Pla Sempere, Leopoldo
dc.contributor.author	Ramírez-Sánchez, Gema
dc.contributor.author	Rupnik, Peter
dc.contributor.author	Suchomel, Vít
dc.contributor.author	Toral, Antonio
dc.contributor.author	Zaragoza-Bernabeu, Jaume
dc.date.accessioned	2023-04-20T06:30:24Z
dc.date.available	2023-04-20T06:30:24Z
dc.date.issued	2023-04-20
dc.identifier.uri	http://hdl.handle.net/11356/1802
dc.description	The Turkish web corpus MaCoCu-tr 2.0 was built by crawling the ".tr" and ".cy" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext) and fluency (score between 0 and 1, assigned with the Monocleaner tool - https://github.com/bitextor/monocleaner), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool - https://github.com/bitextor/biroamer). As opposed to the previous version, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2) (https://github.com/CLD2Owners/cld2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner corpus. The corpus can be easily read with the prevert parser (https://pypi.org/project/prevert/). Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains. A newer version of the corpus is available as part of the MaCoCu-Genre corpora collection at http://hdl.handle.net/11356/1969. The main novelty of the MaCoCu-Genre version is that the texts have been automatically annotated with genre categories. Additionally, the corpus underwent additional post-processing and has been transformed to the JSONL format.
dc.language.iso	tur
dc.publisher	Jožef Stefan Institute
dc.publisher	Prompsit
dc.publisher	Rijksuniversiteit Groningen
dc.publisher	Universitat d'Alacant
dc.relation.isreferencedby	https://hdl.handle.net/11370/685514a8-947e-44f9-83cf-90356c5f1684
dc.relation.replaces	http://hdl.handle.net/11356/1514
dc.rights	CC0-No Rights Reserved
dc.rights.uri	https://creativecommons.org/publicdomain/zero/1.0/
dc.rights.label	PUB
dc.source.uri	https://macocu.eu/
dc.subject	web corpus
dc.title	Turkish web corpus MaCoCu-tr 2.0
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Miquel Esplà-Gomis mespla@dlsi.ua.es Universitat d’Alacant
sponsor	Connecting Europe Facility (CEF) Telecom INEA/CEF/ICT/A2020/2278341 MaCoCu - Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages Other
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	15961125 texts
size.info	4344850253 words
files.count	2
files.size	16185448347