{"id":5162,"date":"2021-06-28T06:52:37","date_gmt":"2021-06-28T06:52:37","guid":{"rendered":"https:\/\/www.clarin.si\/info\/?page_id=5162"},"modified":"2021-06-28T06:54:18","modified_gmt":"2021-06-28T06:54:18","slug":"macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data","status":"publish","type":"page","link":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/","title":{"rendered":"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data"},"content":{"rendered":"<h1>MaCoCu<\/h1>\n<p>The aim of MaCoCu, a <a href=\"https:\/\/ec.europa.eu\/inea\/en\/connecting-europe-facility\">CEF<\/a>-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by <a href=\"https:\/\/www.ijs.si\/ijsw\/JSI\">Jo\u017eef Stefan Institute<\/a>, Ljubljana, Slovenia.<\/p>\n<h2>Web crawling<\/h2>\n<p>We run a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Web_crawler\">web crawler<\/a> to download the texts from the Web. The software we use is <a href=\"http:\/\/corpus.tools\/wiki\/SpiderLing\">SpiderLing<\/a> developed by the\u00a0<a href=\"https:\/\/nlp.fi.muni.cz\/en\/NLPCentre\">Natural Language Processing Centre<\/a> at Masaryk University, Czech Republic.<\/p>\n<h2>What do we do with the downloaded data?<\/h2>\n<p>We are interested in a language use rather than the content of the downloaded texts. The retrieved text will be cleaned, de-duplicated and annotated with text type information. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Text_corpus\">Text corpora<\/a> for <a href=\"https:\/\/en.wikipedia.org\/wiki\/Computational_linguistics\">computational linguistics research<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Language_model\">language models<\/a> for <a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\">natural language processing<\/a> tasks will be built using the data.<\/p>\n<h2>What if I don&#8217;t want my website to be crawled?<\/h2>\n<p>Our crawler adheres to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Robots_exclusion_standard\">Robots exclusion standard<\/a>. You can restrict the access to some or all of the pages on your website by creating a robots.txt file. The user-agent identification of our crawler is <tt>MaCoCu<\/tt>. This is what to include in your robots.txt if you want to prevent our crawler from crawling your website:<\/p>\n<pre>User-agent: MaCoCu\r\nDisallow: \/<\/pre>\n<p>Please note the crawler reads your robots.txt the first time it accesses your site so any changes will be effective the next time the crawler is run, not immediately.<\/p>\n<h2>Contacts<\/h2>\n<ul>\n<li>V\u00edt Suchomel, vit.suchomel at sketchengine dot eu<\/li>\n<li>Nikola Ljube\u0161i\u0107, nljubesi at gmail dot com<\/li>\n<\/ul>\n<!--themify_builder_content-->\n<div id=\"themify_builder_content-5162\" data-postid=\"5162\" class=\"themify_builder_content themify_builder_content-5162 themify_builder tf_clear\">\n    <\/div>\n<!--\/themify_builder_content-->\n","protected":false},"excerpt":{"rendered":"<p>MaCoCu The aim of MaCoCu, a CEF-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by Jo\u017eef Stefan [&hellip;]<\/p>\n","protected":false},"author":13,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"class_list":["post-5162","page","type-page","status-publish","hentry","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"aioseo_notices":[],"aioseo_head":"\n\t\t<!-- All in One SEO 4.9.8 - aioseo.com -->\n\t<meta name=\"description\" content=\"MaCoCu The aim of MaCoCu, a CEF-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by Jo\u017eef Stefan\" \/>\n\t<meta name=\"robots\" content=\"max-image-preview:large\" \/>\n\t<meta name=\"google-site-verification\" content=\"LiA10aq97L10baWhrk27m-8KV46nP_6qo6Z8pFmPF88\" \/>\n\t<link rel=\"canonical\" href=\"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/\" \/>\n\t<meta name=\"generator\" content=\"All in One SEO (AIOSEO) 4.9.8\" \/>\n\t\t<meta property=\"og:locale\" content=\"en_GB\" \/>\n\t\t<meta property=\"og:site_name\" content=\"CLARIN Slovenija - Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije\" \/>\n\t\t<meta property=\"og:type\" content=\"article\" \/>\n\t\t<meta property=\"og:title\" content=\"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data - CLARIN Slovenija\" \/>\n\t\t<meta property=\"og:description\" content=\"MaCoCu The aim of MaCoCu, a CEF-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by Jo\u017eef Stefan\" \/>\n\t\t<meta property=\"og:url\" content=\"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/\" \/>\n\t\t<meta property=\"article:published_time\" content=\"2021-06-28T06:52:37+00:00\" \/>\n\t\t<meta property=\"article:modified_time\" content=\"2021-06-28T06:54:18+00:00\" \/>\n\t\t<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n\t\t<meta name=\"twitter:title\" content=\"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data - CLARIN Slovenija\" \/>\n\t\t<meta name=\"twitter:description\" content=\"MaCoCu The aim of MaCoCu, a CEF-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by Jo\u017eef Stefan\" \/>\n\t\t<script type=\"application\/ld+json\" class=\"aioseo-schema\">\n\t\t\t{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\\\/#breadcrumblist\",\"itemListElement\":[{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info#listItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.clarin.si\\\/info\",\"nextItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\\\/#listItem\",\"name\":\"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data\"}},{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\\\/#listItem\",\"position\":2,\"name\":\"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data\",\"previousItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info#listItem\",\"name\":\"Home\"}}]},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/#organization\",\"name\":\"CLARIN Slovenija\",\"description\":\"Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije\",\"url\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"url\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/wp-content\\\/uploads\\\/2014\\\/08\\\/Clarin-SI-logo.png\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\\\/#organizationLogo\",\"width\":359,\"height\":150},\"image\":{\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\\\/#organizationLogo\"}},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\\\/#webpage\",\"url\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\\\/\",\"name\":\"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data - CLARIN Slovenija\",\"description\":\"MaCoCu The aim of MaCoCu, a CEF-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by Jo\\u017eef Stefan\",\"inLanguage\":\"en-GB\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/#website\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\\\/#breadcrumblist\"},\"datePublished\":\"2021-06-28T06:52:37+00:00\",\"dateModified\":\"2021-06-28T06:54:18+00:00\"},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/#website\",\"url\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/\",\"name\":\"CLARIN Slovenija\",\"description\":\"Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije\",\"inLanguage\":\"en-GB\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/#organization\"}}]}\n\t\t<\/script>\n\t\t<!-- All in One SEO -->\n\n","aioseo_head_json":{"title":"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data - CLARIN Slovenija","description":"MaCoCu The aim of MaCoCu, a CEF-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by Jo\u017eef Stefan","canonical_url":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/","robots":"max-image-preview:large","keywords":"","webmasterTools":{"google-site-verification":"LiA10aq97L10baWhrk27m-8KV46nP_6qo6Z8pFmPF88","miscellaneous":""},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"BreadcrumbList","@id":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/#breadcrumblist","itemListElement":[{"@type":"ListItem","@id":"https:\/\/www.clarin.si\/info#listItem","position":1,"name":"Home","item":"https:\/\/www.clarin.si\/info","nextItem":{"@type":"ListItem","@id":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/#listItem","name":"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data"}},{"@type":"ListItem","@id":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/#listItem","position":2,"name":"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data","previousItem":{"@type":"ListItem","@id":"https:\/\/www.clarin.si\/info#listItem","name":"Home"}}]},{"@type":"Organization","@id":"https:\/\/www.clarin.si\/info\/#organization","name":"CLARIN Slovenija","description":"Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije","url":"https:\/\/www.clarin.si\/info\/","logo":{"@type":"ImageObject","url":"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2014\/08\/Clarin-SI-logo.png","@id":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/#organizationLogo","width":359,"height":150},"image":{"@id":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/#organizationLogo"}},{"@type":"WebPage","@id":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/#webpage","url":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/","name":"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data - CLARIN Slovenija","description":"MaCoCu The aim of MaCoCu, a CEF-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by Jo\u017eef Stefan","inLanguage":"en-GB","isPartOf":{"@id":"https:\/\/www.clarin.si\/info\/#website"},"breadcrumb":{"@id":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/#breadcrumblist"},"datePublished":"2021-06-28T06:52:37+00:00","dateModified":"2021-06-28T06:54:18+00:00"},{"@type":"WebSite","@id":"https:\/\/www.clarin.si\/info\/#website","url":"https:\/\/www.clarin.si\/info\/","name":"CLARIN Slovenija","description":"Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije","inLanguage":"en-GB","publisher":{"@id":"https:\/\/www.clarin.si\/info\/#organization"}}]},"og:locale":"en_GB","og:site_name":"CLARIN Slovenija - Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije","og:type":"article","og:title":"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data - CLARIN Slovenija","og:description":"MaCoCu The aim of MaCoCu, a CEF-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by Jo\u017eef Stefan","og:url":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/","article:published_time":"2021-06-28T06:52:37+00:00","article:modified_time":"2021-06-28T06:54:18+00:00","twitter:card":"summary_large_image","twitter:title":"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data - CLARIN Slovenija","twitter:description":"MaCoCu The aim of MaCoCu, a CEF-funded project, is to collect, curate and enrich monolingual and parallel data from the Internet for 12 under-resourced languages of EU member states and candidate states: Albanian, Bosnian, Bulgarian, Croatian, Greek, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian, and Turkish. The collection of monolingual data is performed by Jo\u017eef Stefan"},"aioseo_meta_data":{"post_id":"5162","title":null,"description":null,"keywords":null,"keyphrases":null,"primary_term":null,"canonical_url":null,"og_title":null,"og_description":null,"og_object_type":"default","og_image_type":"default","og_image_custom_url":null,"og_image_custom_fields":null,"og_image_url":null,"og_image_width":null,"og_image_height":null,"og_video":null,"og_custom_url":null,"og_article_section":null,"og_article_tags":null,"twitter_use_og":false,"twitter_card":"default","twitter_image_type":"default","twitter_image_custom_url":null,"twitter_image_custom_fields":null,"twitter_image_url":null,"twitter_title":null,"twitter_description":null,"schema_type":"default","schema_type_options":null,"schema":{"blockGraphs":[],"customGraphs":[],"default":{"data":{"Article":[],"Course":[],"Dataset":[],"FAQPage":[],"Movie":[],"Person":[],"Product":[],"ProductReview":[],"Car":[],"Recipe":[],"Service":[],"SoftwareApplication":[],"WebPage":[]},"graphName":"","isEnabled":true},"graphs":[]},"pillar_content":false,"robots_default":true,"robots_noindex":false,"robots_noarchive":false,"robots_nosnippet":false,"robots_nofollow":false,"robots_noimageindex":false,"robots_noodp":false,"robots_notranslate":false,"robots_max_snippet":null,"robots_max_videopreview":null,"robots_max_imagepreview":"large","priority":null,"frequency":null,"local_seo":null,"limit_modified_date":false,"ai":null,"breadcrumb_settings":null,"seo_analyzer_scan_date":null,"created":"2026-05-21 09:25:56","updated":"2026-05-21 09:25:56"},"aioseo_breadcrumb":"<div class=\"aioseo-breadcrumbs\"><span class=\"aioseo-breadcrumb\">\n\t\t\t<a href=\"https:\/\/www.clarin.si\/info\" title=\"Home\">Home<\/a>\n\t\t<\/span><span class=\"aioseo-breadcrumb-separator\">&raquo;<\/span><span class=\"aioseo-breadcrumb\">\n\t\t\tMaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data\n\t\t<\/span><\/div>","aioseo_breadcrumb_json":[{"label":"Home","link":"https:\/\/www.clarin.si\/info"},{"label":"MaCoCu: Massive Collection and Curation of Monolingual and Bilingual Data","link":"https:\/\/www.clarin.si\/info\/macocu-massive-collection-and-curation-of-monolingual-and-bilingual-data\/"}],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/5162","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=5162"}],"version-history":[{"count":3,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/5162\/revisions"}],"predecessor-version":[{"id":5165,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/5162\/revisions\/5165"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=5162"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}