{"id":6394,"date":"2023-06-23T13:53:59","date_gmt":"2023-06-23T13:53:59","guid":{"rendered":"https:\/\/www.clarin.si\/info\/?p=6394"},"modified":"2023-06-26T08:03:58","modified_gmt":"2023-06-26T08:03:58","slug":"new-classla-web-corpora-and-tutorial-on-usage-of-the-corpora-via-clarin-si-concordancers","status":"publish","type":"post","link":"https:\/\/www.clarin.si\/info\/new-classla-web-corpora-and-tutorial-on-usage-of-the-corpora-via-clarin-si-concordancers\/","title":{"rendered":"New CLASSLA web corpora and tutorial on usage of the corpora via CLARIN.SI concordancers"},"content":{"rendered":"<p>We are pleased to announce that pilot versions (v0.1) of the CLASSLA-web corpora are now available within the CLASSLA Knowledge Center.\u00a0 The corpora include <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_hr\" target=\"_blank\" rel=\"noopener\">Croatian<\/a>\u00a0(2.3 billion words),\u00a0<a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sr\" target=\"_blank\" rel=\"noopener\">Serbian<\/a>\u00a0(2.4 billion words) and\u00a0<a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sl\" target=\"_blank\" rel=\"noopener\">Slovenian<\/a>\u00a0(1.9 billion words).<\/p>\n<p>In addition to the new corpora, a <a href=\"https:\/\/www.clarin.si\/info\/k-centre\/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers\/\">tutorial on the usage of CLASSLA-web corpora through the concordancers CLARIN.SI<\/a> has been published.<\/p>\n<p>You can read more about the novelties in the CLASSLA Knowledge Center below.<\/p>\n<p><!--more--><\/p>\n<hr \/>\n<p><strong>CLASSLA web corpora of Croatian, Serbian and Slovenian<\/strong><\/p>\n<p>We are\u00a0delighted to announce the release of the pilot versions (v0.1) of the CLASSLA web corpora for\u00a0<a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_hr\" target=\"_blank\" rel=\"noopener\">Croatian<\/a>\u00a0(2.3 billion words),\u00a0<a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sr\" target=\"_blank\" rel=\"noopener\">Serbian<\/a>\u00a0(2.4 billion words) and\u00a0<a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sl\" target=\"_blank\" rel=\"noopener\">Slovenian<\/a>\u00a0(1.9 billion words). The main features of the newly released corpora, aside from their massive size and recency (crawled in 2022) is their\u00a0<a href=\"https:\/\/huggingface.co\/classla\/xlm-roberta-base-multilingual-text-genre-classifier\" target=\"_blank\" rel=\"noopener\">automatic enrichment with genre information<\/a>\u00a0and their linguistic processing with the improved\u00a0<a href=\"https:\/\/pypi.org\/project\/classla\/\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza annotation pipeline<\/a>\u00a0(applied version to be released soon). The corpora are available for search via the CLARIN.SI concordancers,\u00a0<a href=\"https:\/\/www.clarin.si\/ske\/#open\" target=\"_blank\" rel=\"noopener\">Crystal NoSketchEngine<\/a>,\u00a0<a href=\"https:\/\/www.clarin.si\/noske\/\" target=\"_blank\" rel=\"noopener\">Bonito NoSketchEngine<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.clarin.si\/kontext\/corpora\/corplist\" target=\"_blank\" rel=\"noopener\">KonText<\/a>. The pilot versions of these corpora are intended to gather valuable user feedback, while the official release (v1.0) of the three existing corpora, along with web corpora for Bosnian, Montenegrin, Macedonian, and Bulgarian, is scheduled for later this year.<\/p>\n<p>We warmly welcome you to explore our corpora. Please reach out to us at\u00a0<a href=\"mailto:helpdesk.classla@clarin.si\">helpdesk.classla@clarin.si<\/a>\u00a0with any ideas for improvements\u00a0\u2014\u00a0we will try hard to implement them in the upcoming official release already! We also encourage you to share with us how you plan to use these corpora in your research, as well as any other use cases you may have in mind.<\/p>\n<p>To give you some ideas on how the corpora can be used in your research you are invited to read\u00a0<a href=\"https:\/\/www.clarin.si\/info\/k-centre\/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers\/\" target=\"_blank\" rel=\"noopener\">our blog post on the use of CLASSLA web corpora via the open CLARIN.SI concordancers<\/a>. The step-by-step tutorial covers a wide range of functionalities of the concordancers, including finding collocations in different genres, analyzing word statistics, and exploring the use of non-standard words. This resource is particularly suited for linguists, language teachers and digital humanists.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We are pleased to announce that pilot versions (v0.1) of the CLASSLA-web corpora are now available within the CLASSLA Knowledge Center.\u00a0 The corpora include Croatian\u00a0(2.3 billion words),\u00a0Serbian\u00a0(2.4 billion words) and\u00a0Slovenian\u00a0(1.9 billion words). In addition to the new corpora, a tutorial on the usage of CLASSLA-web corpora through the concordancers CLARIN.SI has been published. You can [&hellip;]<\/p>\n","protected":false},"author":12,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[34],"tags":[],"class_list":["post-6394","post","type-post","status-publish","format-standard","hentry","category-events","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/posts\/6394","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/12"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=6394"}],"version-history":[{"count":5,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/posts\/6394\/revisions"}],"predecessor-version":[{"id":6414,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/posts\/6394\/revisions\/6414"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=6394"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/categories?post=6394"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/tags?post=6394"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}