{"id":6663,"date":"2023-11-15T07:56:43","date_gmt":"2023-11-15T07:56:43","guid":{"rendered":"https:\/\/www.clarin.si\/info\/?page_id=6663"},"modified":"2026-03-10T09:09:55","modified_gmt":"2026-03-10T09:09:55","slug":"classla-web-crawler","status":"publish","type":"page","link":"https:\/\/www.clarin.si\/info\/classla-web-crawler\/","title":{"rendered":"CLASSLA-web Crawler"},"content":{"rendered":"<p>The CLASSLA-web crawler is a continuation of the <a href=\"https:\/\/macocu.eu\/\" target=\"_blank\" rel=\"noopener\">MaCoCu<\/a> project, a <a href=\"https:\/\/ec.europa.eu\/inea\/en\/connecting-europe-facility\">CEF<\/a>-funded project, which aimed to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish.<\/p>\n<p>After the MaCoCu project had finished, the project members established the CLASSLA-web crawling infrastructure. The aim of the infrastructure is to continue collecting web corpora for South Slavic and other languages under the auspices of the CLARIN Knowledge Centre for South Slavic languages (CLASSLA). The collection of monolingual data is performed by <a href=\"https:\/\/www.ijs.si\/ijsw\/JSI\">Jo\u017eef Stefan Institute<\/a>, Ljubljana, Slovenia.<\/p>\n<p>&nbsp;<\/p>\n<h2>Web crawling<\/h2>\n<p>We run a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Web_crawler\">web crawler<\/a> to download the texts from the Web. The software we use is <a href=\"http:\/\/corpus.tools\/wiki\/SpiderLing\">SpiderLing<\/a> developed by the\u00a0<a href=\"https:\/\/nlp.fi.muni.cz\/en\/NLPCentre\">Natural Language Processing Centre<\/a> at Masaryk University, Czech Republic.<\/p>\n<p>&nbsp;<\/p>\n<h2>What do we do with the downloaded data?<\/h2>\n<p>We are interested in a language use rather than the content of the downloaded texts. The retrieved text will be cleaned, de-duplicated and annotated with text type information. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Text_corpus\">Text corpora<\/a> for <a href=\"https:\/\/en.wikipedia.org\/wiki\/Computational_linguistics\">computational linguistics research<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Language_model\">language models<\/a> for <a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\">natural language processing<\/a> tasks will be built using the data.<\/p>\n<h2>What if I don&#8217;t want my website to be crawled?<\/h2>\n<p>Our crawler adheres to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Robots_exclusion_standard\">Robots exclusion standard<\/a>. You can restrict the access to some or all of the pages on your website by creating a robots.txt file. The user-agent identification of our crawler is <tt>CLASSLA-web<\/tt>. This is what to include in your robots.txt if you want to prevent our crawler from crawling your website:<\/p>\n<pre>User-agent: CLASSLA-web\r\nDisallow: \/<\/pre>\n<p>Please note the crawler reads your robots.txt the first time it accesses your site so any changes will be effective the next time the crawler is run, not immediately.<\/p>\n<h2>Contacts<\/h2>\n<ul>\n<li>Nikola Ljube\u0161i\u0107, nljubesi at gmail dot com<\/li>\n<li>Taja Kuzman Punger\u0161ek, taja.kuzman at ijs dot si<\/li>\n<li>V\u00edt Suchomel, vit.suchomel at sketchengine dot eu<\/li>\n<\/ul>\n<div id=\"themify_builder_content-6663\" data-postid=\"6663\" class=\"themify_builder_content themify_builder_content-6663 themify_builder\">\n    <\/div>\n<!-- \/themify_builder_content -->\n","protected":false},"excerpt":{"rendered":"<p>The CLASSLA-web crawler is a continuation of the MaCoCu project, a CEF-funded project, which aimed to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. After the MaCoCu project [&hellip;]<\/p>\n","protected":false},"author":13,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-6663","page","type-page","status-publish","hentry","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/6663","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=6663"}],"version-history":[{"count":8,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/6663\/revisions"}],"predecessor-version":[{"id":8854,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/6663\/revisions\/8854"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=6663"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}