{"id":6801,"date":"2023-12-05T10:34:04","date_gmt":"2023-12-05T10:34:04","guid":{"rendered":"https:\/\/www.clarin.si\/info\/?page_id=6801"},"modified":"2023-12-05T12:53:09","modified_gmt":"2023-12-05T12:53:09","slug":"comparable-classla-web-corpora-of-south-slavic-languages","status":"publish","type":"page","link":"https:\/\/www.clarin.si\/info\/k-centre\/comparable-classla-web-corpora-of-south-slavic-languages\/","title":{"rendered":"Comparable CLASSLA web corpora of South Slavic languages"},"content":{"rendered":"<p><strong>An introduction to the comparable CLASSLA web corpora for South Slavic languages, providing details on the corpora sizes and interesting insights based on genre distributions.<\/strong><\/p>\n<pre>Nikola Ljube\u0161i\u0107 and Taja Kuzman \u00b7 December 5, 2023 \u00b7 3-minutes read<\/pre>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA Knowledge centre for South Slavic languages<\/span><\/a><span style=\"font-weight: 400;\"> has released comparable web corpora<\/span><span style=\"font-weight: 400;\"> for all official South Slavic languages, namely <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_sl\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Slovenian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_hr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Croatian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_bs\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Bosnian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_cnr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Montenegrin<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_sr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Serbian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_mk\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Macedonian<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_bg\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Bulgarian<\/span><\/a><span style=\"font-weight: 400;\">, all the corpora summing up to almost 11 billion words! The corpora are freely available on the CLARIN.SI <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#open\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">NoSketch Engine<\/span><\/a><span style=\"font-weight: 400;\"> concordancer (see our recent <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">tutorial on how to easily query the CLASSLA web corpora and perform statistical analyses via the concordancer<\/span><\/a><span style=\"font-weight: 400;\">).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The sizes of each of the corpora in terms of number of tokens, words and documents, are given in the table below.<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/12\/CLASSLA-web-sizes-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-6822 aligncenter\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/12\/CLASSLA-web-sizes-1-300x113.png\" alt=\"\" width=\"602\" height=\"227\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/12\/CLASSLA-web-sizes-1-300x113.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/12\/CLASSLA-web-sizes-1-1024x387.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/12\/CLASSLA-web-sizes-1-768x290.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/12\/CLASSLA-web-sizes-1.png 1095w\" sizes=\"auto, (max-width: 602px) 100vw, 602px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">This collection of corpora is very innovative, due to the following reasons:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">This is, to the best of our knowledge, the first collection of comparable web corpora covering a whole language group.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The collection includes the first general, linguistically annotated corpora for two out of seven languages, namely Montenegrin and Macedonian.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The comparability of the corpora is ensured by performing data collection and filtering in the same time period with the same technologies. Furthermore, the corpora underwent a uniform linguistic processing via the <\/span><a href=\"https:\/\/pypi.org\/project\/classla\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-Stanza<\/span><\/a><span style=\"font-weight: 400;\"> toolkit, which you can now try out also through the <\/span><a href=\"https:\/\/clarin.si\/oznacevalnik\/eng\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA annotator web interface<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/li>\n<li aria-level=\"1\">Each of the documents in each of the corpora is annotated with the <a href=\"https:\/\/huggingface.co\/classla\/xlm-roberta-base-multilingual-text-genre-classifier\" target=\"_blank\" rel=\"noopener\">X-GENRE multilingual genre classifier<\/a>. The normalized distribution of genre labels inside the CLASSLA web corpora are presented in the following figure.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/12\/CLASSLA-web-corpora-distribution-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-6823 aligncenter\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/12\/CLASSLA-web-corpora-distribution-1-300x153.png\" alt=\"\" width=\"666\" height=\"340\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/12\/CLASSLA-web-corpora-distribution-1-300x153.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/12\/CLASSLA-web-corpora-distribution-1-1024x521.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/12\/CLASSLA-web-corpora-distribution-1-768x391.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/12\/CLASSLA-web-corpora-distribution-1.png 1410w\" sizes=\"auto, (max-width: 666px) 100vw, 666px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">The comparison of genre distributions across CLASSLA web corpora shows limitations of the comparability of web crawls performed even in neighboring linguistically related countries. Thanks to the automatically annotated genre information, the differences between the corpora can be circumvented by controlling for the genre distribution. Interested in the reasons for such strong differences between the genre distributions, one can already visually identify that the news genre on one side, and the promotion genre on the other are the main driving forces of difference between genres across these seven languages, and on top of that, very negatively correlated between each other. Hypothesizing that the amount of promotion material on a national web corresponds to the amount of economic activity, we preliminarily investigated how the <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/List_of_countries_by_GDP_(PPP)_per_capita\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">GDP PPP per capita<\/span><\/a><span style=\"font-weight: 400;\"> across the seven countries correlates with the promotion and the news genre distributions. By calculating the Pearson correlation, we obtain a very high positive correlation between GDP PPP per capita and the promotion genre (r=.938, N=7, p=.002), as well as a very high negative correlation between GDP PPP per capita and the news genre (r=-.9, N=7, p=.006). This is only a very small example of the interesting insights that one can obtain from having genre information on every of the 26 million documents. We are very excited to see all of the interesting research that will be performed on the CLASSLA web corpora!<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We will be very glad to obtain feedback on our corpora and annotation technology. As usual, please write to us on <\/span><a href=\"mailto:helpdesk.classla@clarin.si\"><span style=\"font-weight: 400;\">helpdesk.classla@clarin.si<\/span><\/a><span style=\"font-weight: 400;\">!<\/span><\/p>\n<p><span style=\"font-weight: 400;\">These corpora would not have been released without great collaboration inside the CLASSLA Knowledge centre for South Slavic languages, which includes the Slovenian consortium <\/span><a href=\"https:\/\/www.clarin.si\/info\/about\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLARIN.SI<\/span><\/a><span style=\"font-weight: 400;\">, the <\/span><a href=\"http:\/\/ihjj.hr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Institute of Croatian Language<\/span><\/a><span style=\"font-weight: 400;\">, and the Bulgarian consortium <\/span><a href=\"https:\/\/clada-bg.eu\/en\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLADA-BG<\/span><\/a><span style=\"font-weight: 400;\">. Furthermore, very crucial were the longstanding collaboration with the <\/span><a href=\"https:\/\/reldi.spur.uzh.ch\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">ReLDI centre<\/span><\/a><span style=\"font-weight: 400;\"> on a series of South Slavic languages, and Biljana Stojanovska and Katerina Zdravkova on Macedonian. On this occasion, we want to thank everyone for the collaboration, and invite others to join our common efforts!<\/span><\/p>\n<!--themify_builder_content-->\n<div id=\"themify_builder_content-6801\" data-postid=\"6801\" class=\"themify_builder_content themify_builder_content-6801 themify_builder tf_clear\">\n    <\/div>\n<!--\/themify_builder_content-->\n","protected":false},"excerpt":{"rendered":"<p>An introduction to the comparable CLASSLA web corpora for South Slavic languages, providing details on the corpora sizes and interesting insights based on genre distributions. Nikola Ljube\u0161i\u0107 and Taja Kuzman \u00b7 December 5, 2023 \u00b7 3-minutes read &nbsp; The CLASSLA Knowledge centre for South Slavic languages has released comparable web corpora for all official South [&hellip;]<\/p>\n","protected":false},"author":13,"featured_media":0,"parent":3558,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-6801","page","type-page","status-publish","hentry","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/6801","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=6801"}],"version-history":[{"count":18,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/6801\/revisions"}],"predecessor-version":[{"id":6827,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/6801\/revisions\/6827"}],"up":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3558"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=6801"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}