MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data

	Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije Common Language Resources and Technology Infrastructure, Slovenia

MaCoCu

The aim of MaCoCu, a CEF-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by Jožef Stefan Institute, Ljubljana, Slovenia.

Web crawling

We run a web crawler to download the texts from the Web. The software we use is SpiderLing developed by the Natural Language Processing Centre at Masaryk University, Czech Republic.

What do we do with the downloaded data?

We are interested in a language use rather than the content of the downloaded texts. The retrieved text will be cleaned, de-duplicated and annotated with text type information. Text corpora for computational linguistics research and language models for natural language processing tasks will be built using the data.

What if I don’t want my website to be crawled?

Our crawler adheres to the Robots exclusion standard. You can restrict the access to some or all of the pages on your website by creating a robots.txt file. The user-agent identification of our crawler is MaCoCu. This is what to include in your robots.txt if you want to prevent our crawler from crawling your website:

User-agent: MaCoCu
Disallow: /

Please note the crawler reads your robots.txt the first time it accesses your site so any changes will be effective the next time the crawler is run, not immediately.

Contacts

Vít Suchomel, vit.suchomel at sketchengine dot eu
Nikola Ljubešić, nljubesi at gmail dot com