{"id":5162,"date":"2021-06-28T06:52:37","date_gmt":"2021-06-28T06:52:37","guid":{"rendered":"https:\/\/www.clarin.si\/info\/?page_id=5162"},"modified":"2021-06-28T06:54:18","modified_gmt":"2021-06-28T06:54:18","slug":"macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data","status":"publish","type":"page","link":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/","title":{"rendered":"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data"},"content":{"rendered":"<h1>MaCoCu<\/h1>\n<p>The aim of MaCoCu, a <a href=\"https:\/\/ec.europa.eu\/inea\/en\/connecting-europe-facility\">CEF<\/a>-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by <a href=\"https:\/\/www.ijs.si\/ijsw\/JSI\">Jo\u017eef Stefan Institute<\/a>, Ljubljana, Slovenia.<\/p>\n<h2>Web crawling<\/h2>\n<p>We run a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Web_crawler\">web crawler<\/a> to download the texts from the Web. The software we use is <a href=\"http:\/\/corpus.tools\/wiki\/SpiderLing\">SpiderLing<\/a> developed by the\u00a0<a href=\"https:\/\/nlp.fi.muni.cz\/en\/NLPCentre\">Natural Language Processing Centre<\/a> at Masaryk University, Czech Republic.<\/p>\n<h2>What do we do with the downloaded data?<\/h2>\n<p>We are interested in a language use rather than the content of the downloaded texts. The retrieved text will be cleaned, de-duplicated and annotated with text type information. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Text_corpus\">Text corpora<\/a> for <a href=\"https:\/\/en.wikipedia.org\/wiki\/Computational_linguistics\">computational linguistics research<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Language_model\">language models<\/a> for <a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\">natural language processing<\/a> tasks will be built using the data.<\/p>\n<h2>What if I don&#8217;t want my website to be crawled?<\/h2>\n<p>Our crawler adheres to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Robots_exclusion_standard\">Robots exclusion standard<\/a>. You can restrict the access to some or all of the pages on your website by creating a robots.txt file. The user-agent identification of our crawler is <tt>MaCoCu<\/tt>. This is what to include in your robots.txt if you want to prevent our crawler from crawling your website:<\/p>\n<pre>User-agent: MaCoCu\r\nDisallow: \/<\/pre>\n<p>Please note the crawler reads your robots.txt the first time it accesses your site so any changes will be effective the next time the crawler is run, not immediately.<\/p>\n<h2>Contacts<\/h2>\n<ul>\n<li>V\u00edt Suchomel, vit.suchomel at sketchengine dot eu<\/li>\n<li>Nikola Ljube\u0161i\u0107, nljubesi at gmail dot com<\/li>\n<\/ul>\n<div id=\"themify_builder_content-5162\" data-postid=\"5162\" class=\"themify_builder_content themify_builder_content-5162 themify_builder\">\n    <\/div>\n<!-- \/themify_builder_content -->\n","protected":false},"excerpt":{"rendered":"<p>MaCoCu The aim of MaCoCu, a CEF-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by Jo\u017eef Stefan [&hellip;]<\/p>\n","protected":false},"author":13,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-5162","page","type-page","status-publish","hentry","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/5162","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=5162"}],"version-history":[{"count":3,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/5162\/revisions"}],"predecessor-version":[{"id":5165,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/5162\/revisions\/5165"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=5162"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}