{"id":6349,"date":"2023-06-22T10:38:46","date_gmt":"2023-06-22T10:38:46","guid":{"rendered":"https:\/\/www.clarin.si\/info\/?page_id=6349"},"modified":"2023-06-22T11:15:16","modified_gmt":"2023-06-22T11:15:16","slug":"classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers","status":"publish","type":"page","link":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers\/","title":{"rendered":"CLASSLA-web: Bigger and Better Web Corpora for Croatian, Serbian and Slovenian on CLARIN.SI Concordancers"},"content":{"rendered":"<p><strong>A tutorial on how to use a CLARIN.SI tool for easy querying and statistical analysis of text collections, especially appropriate for linguists, language teachers, digital humanists, corpus linguists and others. We show you how you can query new massive web text collections for Croatian, Slovenian and Serbian and find collocations, word statistics, context of non-standard words and more.<\/strong><\/p>\n<pre>Taja Kuzman and Nikola Ljube\u0161i\u0107 \u00b7 June 22, 2023 \u00b7 10-minutes read<\/pre>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">CLASSLA-web are new massive web corpora for Slovenian (<\/span><a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sl\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-web.sl<\/span><\/a><span style=\"font-weight: 400;\">), Croatian (<\/span><a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_hr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-web.hr<\/span><\/a><span style=\"font-weight: 400;\">) and Serbian (<\/span><a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-web.sr<\/span><\/a><span style=\"font-weight: 400;\">). These are pilot corpora for the web corpora of all South Slavic languages that will be published later this year by the <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA Knowledge Centre for South Slavic languages<\/span><\/a><span style=\"font-weight: 400;\">. In this blog post, we will show you how you can query them on the <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#open\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLARIN.SI concordancers<\/span><\/a><span style=\"font-weight: 400;\"> to explore word usages, collocations, good dictionary examples, differences in usage in different genres and more.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Concordancers are computer programs that enable effortless searching and statistical treatment of data in big text collections (= corpora) also for those that are less tech-savvy. The Slovenian research infrastructure<\/span> <a href=\"https:\/\/www.clarin.si\/info\/about\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLARIN.SI<\/span><\/a><span style=\"font-weight: 400;\"> provides three open concordancers, and you can access the CLASSLA web corpora on all three of them (read more <\/span><a href=\"https:\/\/www.clarin.si\/info\/concordances\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">here<\/span><\/a><span style=\"font-weight: 400;\"> on how they differ). The examples in this post are from the<\/span><a href=\"https:\/\/www.clarin.si\/ske\/#open\" target=\"_blank\" rel=\"noopener\"> <span style=\"font-weight: 400;\">CLARIN.SI noSketchEngine<\/span><\/a><span style=\"font-weight: 400;\"> concordancer, an open-source version of the commercial <\/span><a href=\"https:\/\/www.sketchengine.co.uk\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Sketch Engine<\/span><\/a><span style=\"font-weight: 400;\">, which was developed by<\/span> <a href=\"https:\/\/www.lexicalcomputing.com\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Lexical Computing<\/span><\/a><span style=\"font-weight: 400;\">. The corpora are available for querying here: <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sl\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-web.sl<\/span><\/a><span style=\"font-weight: 400;\">,<\/span> <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_hr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-web.hr<\/span><\/a><span style=\"font-weight: 400;\"> and<\/span> <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-web.sr<\/span><\/a><span style=\"font-weight: 400;\">. Feel free to open them and try to reproduce the steps in the blog post, while you are reading it, and then experiment some more on your own. We are very interested in your experience with the corpora. Let us know if you noticed anything that you liked or disliked via <\/span><a href=\"mailto:helpdesk.classla@clarin.si\"><span style=\"font-weight: 400;\">helpdesk.classla@clarin.si<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h3><b>Introducing the CLASSLA web corpora<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">As part of the\u00a0<\/span><a href=\"https:\/\/macocu.eu\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">MaCoCu<\/span><\/a><span style=\"font-weight: 400;\"> project, we collected monolingual and parallel web corpora for more than 10 European under-resourced languages and made them freely available (you can download them at <\/span><a href=\"https:\/\/macocu.eu\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/macocu.eu\/<\/span><\/a><span style=\"font-weight: 400;\">). They can be used for training machine translation systems, language models and other language technologies. To make the datasets more usable for linguists and corpus linguists, the<\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/\" target=\"_blank\" rel=\"noopener\"> <span style=\"font-weight: 400;\">CLASSLA<\/span><\/a><span style=\"font-weight: 400;\"> team linguistically annotated the monolingual datasets for Slovenian, Serbian and Croatian with the <\/span><a href=\"https:\/\/pypi.org\/project\/classla\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">state-of-the-art pipeline for linguistic annotation CLASSLA Stanza<\/span><\/a><span style=\"font-weight: 400;\">. Additionally, we enriched them with metadata on genre with the <\/span><a href=\"https:\/\/huggingface.co\/classla\/xlm-roberta-base-multilingual-text-genre-classifier\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">multilingual genre classifier X-GENRE<\/span><\/a><span style=\"font-weight: 400;\">. Finally, we made them freely available at the<\/span> <a href=\"https:\/\/www.clarin.si\/info\/concordances\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLARIN.SI concordancers<\/span><\/a><span style=\"font-weight: 400;\">. In this blog post, we will show you what kind of insights can linguists and digital humanists obtain from the CLASSLA web corpora with just a few clicks.<\/span><\/p>\n<h4><b>But first, some information on the corpora<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">The CLASSLA web corpora come from the<\/span> <a href=\"http:\/\/hdl.handle.net\/11356\/1795\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">MaCoCu-sl<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1806\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">MaCoCu-hr<\/span><\/a><span style=\"font-weight: 400;\"> and<\/span> <a href=\"http:\/\/hdl.handle.net\/11356\/1807\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">MaCoCu-sr<\/span><\/a><span style=\"font-weight: 400;\"> monolingual corpora. They were collected by crawling primarily the national top-level internet domains, that is, \u201c.si\u201d for Slovenian, \u201c.hr\u201d for Croatian and \u201c.rs\u201d and \u201c.\u0441\u0440\u0431\u201d for Serbian, but also crawling generic (\u201c.com\u201d, \u201c.net\u201d etc.) domains well connected to websites on respective top-level domains. You can find more information on the corpora collection and curation at<\/span> <a href=\"https:\/\/macocu.eu\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/macocu.eu\/<\/span><\/a><span style=\"font-weight: 400;\"> and at the dataset entries on the CLARIN.SI repository (<\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1795\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">MaCoCu-sl<\/span><\/a><span style=\"font-weight: 400;\">,<\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1806\" target=\"_blank\" rel=\"noopener\"> <span style=\"font-weight: 400;\">MaCoCu-hr<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1807\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">MaCoCu-sr<\/span><\/a><span style=\"font-weight: 400;\">).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Each CLASSLA web corpus comprises around 2 billion words and 6 to 8 million texts. The Slovenian and Croatian CLASSLA web corpora are roughly two times bigger than the previous web corpora for these two languages, the <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=slwac\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">slWaC<\/span><\/a><span style=\"font-weight: 400;\"> and the<\/span> <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=hrwac\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">hrWaC<\/span><\/a><span style=\"font-weight: 400;\">, while the Serbian corpus is even 5 times bigger than the previous web corpus, the<\/span>\u00a0<a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=srwac\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">srWaC<\/span><\/a><span style=\"font-weight: 400;\">!<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In addition to being much bigger, the CLASSLA web corpora are much more recent, which means that they include terms that have been only recently introduced, such as the \u201cCOVID-19\u201d, \u201cself-isolation\u201d and so on. They include texts published up to 2021 (CLASSLA-web.sl) or even 2022 (CLASSLA-web.hr and CLASSLA-web.sr). They also provide rich metadata, including information on:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">the web domain and URL of the text<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">the genre of the text<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">language identification information<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">script (Latin or Cyrillic, for Serbian only)<\/span><\/li>\n<\/ul>\n<h3><b>Obtaining context for non-standard words<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The web corpora do not include only texts that were written by professional writers in a formal context. They aim to encompass all the texts written on the web. Thereby, they also include texts from forums and personal blogs. Thus, they can provide valuable insight into dialectal and non-standard use of language. You can find information on the usage of words that are not included in a general dictionary.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For instance, the Slovenian word \u201cmetek\u201d (\u201cbullet\u201d) is shunned by the Slovenian linguists and copy-editors, because it is thought to\u00a0<\/span><a href=\"https:\/\/fran.si\/193\/marko-snoj-slovenski-etimoloski-slovar\/4288797\/mtek?View=1&amp;Query=metek\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">be a loan word from Croatian and Serbian<\/span><\/a><span style=\"font-weight: 400;\">. It is not included in the Slovenian general dictionaries and copy-editors will replace it with Slovenian equivalents \u201cizstrelek\u201d or \u201ckrogla\u201d. However, native Slovenian speakers still feel that this word is a legitimate Slovenian word and we can explore how they use the word on the web in CLASSLA-web.sl.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We can find the sentences that contain the word (= concordances) by clicking on the button CONCORDANCE on the dashboard or on the icon for the concordance in the menu bar on the left.<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6350\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic1-300x134.png\" alt=\"\" width=\"871\" height=\"389\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic1-300x134.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic1-768x342.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic1.png 1000w\" sizes=\"auto, (max-width: 871px) 100vw, 871px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">We simply input the word and press SEARCH.<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6351\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic2-300x90.png\" alt=\"\" width=\"877\" height=\"263\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic2-300x90.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic2-768x230.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic2.png 1000w\" sizes=\"auto, (max-width: 877px) 100vw, 877px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">We obtain a list of concordances with the word \u201cmetek\u201d \u200a\u2014 \u200athere are around 2000 sentences in the Slovenian web corpus with this word.<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6352\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic3-300x114.png\" alt=\"\" width=\"850\" height=\"323\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic3-300x114.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic3-768x291.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic3.png 1000w\" sizes=\"auto, (max-width: 850px) 100vw, 850px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">We can inspect in which text types (genres) the word occurs the most by clicking on the icon for Frequency. <\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic4.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6353\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic4-300x22.png\" alt=\"\" width=\"832\" height=\"61\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic4-300x22.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic4-768x56.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic4.png 1000w\" sizes=\"auto, (max-width: 832px) 100vw, 832px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Click on the button TEXT TYPES to get more information on the occurrence of the word in specific genres, presented below:<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic5.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6354\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic5-300x109.png\" alt=\"\" width=\"864\" height=\"314\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic5-300x109.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic5-768x279.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic5.png 1000w\" sizes=\"auto, (max-width: 864px) 100vw, 864px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">We can see that the word \u201cmetek\u201d is most frequently used in Forums and Opinionated texts where people express themselves freely. However, surprisingly, we can also find this word in News and even in Legal texts. By clicking on the three dots on the right, you can inspect the concordances inside a specific genre.<\/span><\/p>\n<h3><b>Searching for collocations<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">As the CLASSLA web corpora were collected in 2021 or 2022, they also provide information on recent words, such as words, connected with the COVID-19 pandemic. We can obtain information on how these words are used in context and with which words they often co-occur.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For instance, to see which words often co-occur with the Croatian word \u201ckarantena\u201d (\u201cquarantine\u201d), we first search for the word in the Concordance window inside the CLASSLA-web.hr corpus, as we did with \u201cmetek\u201d. Once the window with the concordances opens, choose the button for Collocations:<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic6.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6355\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic6-300x27.png\" alt=\"\" width=\"611\" height=\"55\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic6-300x27.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic6.png 716w\" sizes=\"auto, (max-width: 611px) 100vw, 611px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">You can specify whether you are searching for words on the left or the right side of your word and how far from the word they appear. You can also choose whether you want the list to show words (word forms as they appear in the text), lemmas (base, dictionary forms of words), part-of-speech tags (to investigate whether this word is surrounded by verbs or nouns more often), and so on.<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic7.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6356\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic7-300x122.png\" alt=\"\" width=\"792\" height=\"322\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic7-300x122.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic7-768x312.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic7.png 959w\" sizes=\"auto, (max-width: 792px) 100vw, 792px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Below, we see the results\u200a\u2014\u200awe can see that the Croatian word \u201ckarantena\u201d (\u201cquarantine\u201d) co-occurs often with the word \u201csamoizolacija\u201d (\u201cself-isolation\u201d), with the adjectives describing the length of the quarantine (14-days, 2-week) and other related words.<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic8.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6357\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic8-300x137.png\" alt=\"\" width=\"797\" height=\"364\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic8-300x137.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic8.png 1000w\" sizes=\"auto, (max-width: 797px) 100vw, 797px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Now, what if we want to find only verbs that occur with the word \u201cquarantine\u201d? We can do that by using the advanced search for the concordance. As shown in the example below, we specify that we are searching for two words \u200a\u2014 \u200athe first needs to be a verb ([pos=\u201dVERB\u201d]), and the second is the word \u201ckarantena\u201d ([lemma=\u201dkarantena\u201d]) in any form. If you are not familiar with this format, the CQL BUILDER can lead you through the process of creating such query.<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.0.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6358\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.0-300x158.png\" alt=\"\" width=\"769\" height=\"405\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.0-300x158.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.0-768x403.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.0.png 1000w\" sizes=\"auto, (max-width: 769px) 100vw, 769px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">A list of concordances with this phrase appears. To get a list of the most frequent combinations of verbs and the word \u201ckarantena\u201d, click on the button for Frequency and choose the option \u201cLEMMAS\u201d for KWIC (Key Word in Context).<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6359\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.1-300x122.png\" alt=\"\" width=\"770\" height=\"313\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.1-300x122.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.1-768x313.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.1.png 1000w\" sizes=\"auto, (max-width: 770px) 100vw, 770px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">A list with hits with the words in their base, dictionary form appears. By clicking on the three dots on the left, you can inspect examples for a specific phrase. We can see that the most frequent verbs that precede the word \u201ckarantena\u201d (\u201cquarantine\u201d) are \u201cuvesti\u201d (\u201cimpose\u201d), \u201cproglasiti\u201d (\u201cannounce\u201d), \u201cuvoditi\u201d (\u201cintroduce\u201d), \u201cpro\u0107i\u201d (\u201cget through\u201d), and \u201cizbje\u0107i\u201d (\u201cavoid\u201d).<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6360\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.2-300x130.png\" alt=\"\" width=\"734\" height=\"318\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.2-300x130.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.2-768x332.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.2.png 1000w\" sizes=\"auto, (max-width: 734px) 100vw, 734px\" \/><\/a><\/p>\n<h3><b>Obtaining the most informative examples<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The NoSketch Engine concordancer also provides an option to automatically identify sentences which are easy to understand and especially illustrative for language learners. If you are curious on how you should use the Serbian word \u201csamoizolacija\u201d (\u201cself-isolation\u201d), you can search for the concordances with this word in the CLASSLA-web.sr corpus and then click on the icon Good Dictionary Examples (GDEX):<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6361\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.3-300x101.png\" alt=\"\" width=\"808\" height=\"272\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.3-300x101.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.3-768x259.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.3.png 1000w\" sizes=\"auto, (max-width: 808px) 100vw, 808px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">The tool provides nice examples of how the word is used in sentences in combination with common collocations and in common word forms:<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.4.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6362\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.4-300x76.png\" alt=\"\" width=\"805\" height=\"204\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.4-300x76.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.4-768x194.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.4.png 1000w\" sizes=\"auto, (max-width: 805px) 100vw, 805px\" \/><\/a><\/p>\n<h3><b>Genre in the CLASSLA web corpora<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Genre of the texts was automatically identified with the multilingual genre classifier<\/span> <a href=\"https:\/\/huggingface.co\/classla\/xlm-roberta-base-multilingual-text-genre-classifier\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">X-GENRE<\/span><\/a><span style=\"font-weight: 400;\">. Genre is a phenomenon, observed on the text level, so if a web text is very short (consists of 75 words or less), we deemed it to be inappropriate for reliable genre prediction, and was annotated as \u201cShort\u201d instead. If the text has characteristics of multiple genres, it was annotated as \u201cMix\u201d. Genre metadata allow us to easily perform linguistic and sociolinguistic analysis on how certain words or phrases are used in different situational contexts.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For instance, let\u2019s look at how the Croatian word \u201cpas\u201d (\u201cdog\u201d) is used in various text types in the Croatian CLASSLA-web.hr corpus. When searching for the concordances, specify the genre of your choice in the option Text types, as shown below:<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.5.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6363\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.5-300x162.png\" alt=\"\" width=\"741\" height=\"400\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.5-300x162.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.5-768x415.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.5.png 1000w\" sizes=\"auto, (max-width: 741px) 100vw, 741px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Once the list with concordances appears, click on the Collocations button, as we did with the word \u201ckarantena\u201d, and inspect lemmas that occur directly to the right and left of the word \u201cpas\u201d:<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/lema-base-form-pas.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6376\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/lema-base-form-pas-300x136.png\" alt=\"\" width=\"748\" height=\"339\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/lema-base-form-pas-300x136.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/lema-base-form-pas-768x349.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/lema-base-form-pas.png 952w\" sizes=\"auto, (max-width: 748px) 100vw, 748px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">This is how we obtain collocations for the word, used in a specific genre. For instance, the most frequent collocations in texts from Forums are \u201cpas mater\u201d (\u201cdog (does sth. to) mother\u201d) \u200a\u2014\u200a which is a part of a Croatian swear word (click on the three dots on the left to find out which)\u200a\u2014\u200a, \u201cpas lajati\u201d (\u201cdog bark\u201d) and \u201cdupli pas\u201d (\u201cone-two\u201d\u200a \u2014 \u200aa sports term).<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pas-results.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6378\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pas-results-300x85.png\" alt=\"\" width=\"748\" height=\"212\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pas-results-300x85.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pas-results-768x217.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pas-results.png 906w\" sizes=\"auto, (max-width: 748px) 100vw, 748px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">If we repeat the steps and choose different genres, we can see how the most frequent collocations with this word change depending on the text type. Here are some instances from the most frequent collocations:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Legal: \u201cneupisan pas\u201d (\u201cunregistered dog\u201d), \u201cmikro\u010dipiranje pasa\u201d (\u201cmicrochipping a dog\u201d)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Promotion: \u201c\u0161i\u0161anje pasa\u201d (\u201cdog grooming\u201d), \u201cizbirljiv pas\u201d (\u201cpicky-eater\u201d)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">News: \u201cpotra\u017eni pas\u201d (\u201csearch dog\u201d), \u201cpas lutalica\u201d (\u201cstray dog\u201d)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Opinion\/Argumentation: \u201cbijesan pas\u201d (\u201cmad\/rabid dog\u201d), \u201csusjedov pas\u201d (\u201cneighbour\u2019s dog\u201d)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Instruction: \u201cva\u0161 pas\u201d (\u201cyour dog\u201d), \u201cudomiti psa\u201d (\u201cadopt a dog\u201d)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Prose\/Poetry: \u201c\u010dovjekoliki pas\u201d (\u201chuman-like dog\u201d), \u201cpas traga\u010d\u201d (\u201cbloodhound\u201d)<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Information\/Explanation: \u201clova\u010dki pas\u201d (\u201chunting dog\u201d), \u201cpastirski pas\u201d (\u201csheep dog\u201d)<\/span><\/li>\n<\/ul>\n<h3><b>Leveraging other metadata<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The Serbian CLASSLA-web.sr corpus also provides information whether the text was originally written in Cyrillic or Latin.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Let\u2019s search for the phrase \u201cvakcinacija\u201d (\u201cvaccination\u201d) in the Serbian corpus. Then click on the Frequency button and choose TEXT TYPES.<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.8.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6366\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.8-300x61.png\" alt=\"\" width=\"900\" height=\"183\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.8-300x61.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.8-768x157.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.8.png 1000w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">The Relative Density metric tells us the relative frequency of the phrase in a text type\u200a \u2014 \u200aif it is above 100%, that means that the phrase is more frequent in this text type than in the entire corpus and that it is typical for this text type. Based on this metric, we can see that vaccination is more frequently discussed in Cyrillic than in Latin texts.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Let us then search for another term, \u201cantivakser\u201d (\u201canti-vaxxer\u201d), and inspect the text type distribution again.<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.9.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6368\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.9-300x29.png\" alt=\"\" width=\"797\" height=\"77\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.9-300x29.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.9-768x74.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.9.png 1000w\" sizes=\"auto, (max-width: 797px) 100vw, 797px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">It seems that vaccination is more frequently mentioned in the Cyrillic script, while people opposing vaccination are being discussed more in the Latin script.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">On the same page, we can also obtain information on which web sites the word occurs most frequently. By clicking on the three dots on the left, we can inspect instances that come from that specific web site. From the statistics for the word \u201cantivakser\u201d (\u201canti-vaxxer\u201d), we can see that this topic is the most discussed in forums and blogs.<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.9.1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6367\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.9.1-300x80.png\" alt=\"\" width=\"998\" height=\"266\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.9.1-300x80.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/06\/pic9.9.1-768x204.png 768w\" sizes=\"auto, (max-width: 998px) 100vw, 998px\" \/><\/a><\/p>\n<h3><b>Conclusion<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">To sum up, in this blog, we have shown:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">how you can search for any word or phrase in massive web corpora of Slovenian, Serbian and Croatian. Since these text collections comprise also texts that were not written by professional writers, we can also find slang and dialectal words, as well as other non-standard language features that might not be present in dictionaries and standard text collections;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">how you can search for collocations of words using the Collocations feature and even filter them by genre;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">how you can search for more specific collocations, such as a phrase that consists of a verb that precedes a certain noun, by using the Advanced search option and the Frequency feature;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">how you can analyse the frequency of a word or a phrase in different genres, web sites, and\u200a \u2014 \u200afor the Serbian corpus\u200a \u2014\u200a also for different scripts;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">how you can obtain most informative examples of word usage, which can especially help language learners and teachers.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">We hope that this tutorial inspired you to use the CLASSLA web corpora for your work, research or other uses. You can access them for free from these links:<\/span> <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sl\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-web.sl<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_hr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-web.hr<\/span><\/a><span style=\"font-weight: 400;\"> and<\/span> <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-web.sr<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We will be very happy to hear from you\u200a \u2014 \u200alet us know what about the CLASSLA web corpora you like and dislike via <a href=\"mailto:helpdesk.classla@clarin.si\">helpdesk.classla@clarin.si<\/a>. If you are interested in South Slavic resources and technologies, we also invite you to join the<\/span> <a href=\"https:\/\/mailman.ijs.si\/mailman\/listinfo\/classla\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA mailing list<\/span><\/a><span style=\"font-weight: 400;\"> and to follow the<\/span> <a href=\"https:\/\/twitter.com\/ClarinSlovenia\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLARIN.SI infrastructure on Twitter<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<div id=\"themify_builder_content-6349\" data-postid=\"6349\" class=\"themify_builder_content themify_builder_content-6349 themify_builder\">\n    <\/div>\n<!-- \/themify_builder_content -->\n","protected":false},"excerpt":{"rendered":"<p>A tutorial on how to use a CLARIN.SI tool for easy querying and statistical analysis of text collections, especially appropriate for linguists, language teachers, digital humanists, corpus linguists and others. We show you how you can query new massive web text collections for Croatian, Slovenian and Serbian and find collocations, word statistics, context of non-standard [&hellip;]<\/p>\n","protected":false},"author":13,"featured_media":0,"parent":3558,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-6349","page","type-page","status-publish","hentry","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/6349","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=6349"}],"version-history":[{"count":15,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/6349\/revisions"}],"predecessor-version":[{"id":6386,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/6349\/revisions\/6386"}],"up":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3558"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=6349"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}