Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije
Common Language Resources and Technology Infrastructure, Slovenia

CLASSLA-web Crawler

The CLASSLA-web crawler is a continuation of the MaCoCu project, a CEF-funded project, which aimed to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish.

After the MaCoCu project had finished, the project members established the CLASSLA-web crawling infrastructure. The aim of the infrastructure is to continue collecting web corpora for South Slavic and other languages under the auspices of the CLARIN Knowledge Centre for South Slavic languages (CLASSLA). The collection of monolingual data is performed by Jožef Stefan Institute, Ljubljana, Slovenia.

Web crawling

We run a web crawler to download the texts from the Web. The software we use is SpiderLing developed by the Natural Language Processing Centre at Masaryk University, Czech Republic.

What do we do with the downloaded data?

We are interested in a language use rather than the content of the downloaded texts. The retrieved text will be cleaned, de-duplicated and annotated with text type information. Text corpora for computational linguistics research and language models for natural language processing tasks will be built using the data.

What if I don’t want my website to be crawled?

Our crawler adheres to the Robots exclusion standard. You can restrict the access to some or all of the pages on your website by creating a robots.txt file. The user-agent identification of our crawler is CLASSLA-web. This is what to include in your robots.txt if you want to prevent our crawler from crawling your website:

User-agent: CLASSLA-web
Disallow: /

Please note the crawler reads your robots.txt the first time it accesses your site so any changes will be effective the next time the crawler is run, not immediately.

Contacts

  • Nikola Ljubešić, nljubesi at gmail dot com
  • Taja Kuzman, taja.kuzman at ijs dot si
  • Vít Suchomel, vit.suchomel at sketchengine dot eu