{"id":9010,"date":"2026-05-28T08:11:31","date_gmt":"2026-05-28T08:11:31","guid":{"rendered":"https:\/\/www.clarin.si\/info\/?page_id=9010"},"modified":"2026-06-17T08:23:53","modified_gmt":"2026-06-17T08:23:53","slug":"classla-web_guide","status":"publish","type":"page","link":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/","title":{"rendered":"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web"},"content":{"rendered":"<p><span style=\"font-size: 16px;\">The CLASSLA-web corpora are a <strong>collection of text from the internet<\/strong>, its newest version covering <strong>seven South Slavic languages<\/strong>:<\/span><\/p>\n<ul>\n<li>Slovenian<\/li>\n<li>Croatian<\/li>\n<li>Bosnian<\/li>\n<li>Montenegrin<\/li>\n<li>Serbian<\/li>\n<li>Macedonian<\/li>\n<li>Bulgarian<\/li>\n<\/ul>\n<p>The second version of the CLASSLA-web corpus collection substantially expands the original release, growing from 13 billion tokens and 26 million documents to more than 17 billion words and 38 million documents. It currently represents the largest general corpus available for each of the seven South Slavic languages. In the case of Macedonian, it is also the first linguistically annotated general corpus for the language.<\/p>\n<p>The corpora were developed by the CLASSLA Knowledge Centre for South Slavic languages, which is officially operated by <a href=\"https:\/\/www.clarin.si\/\">CLARIN.SI<\/a>. They are built on text data collected by crawling primarily the national top-level internet domains (e.g. &#8220;.si&#8221; for Slovenian, &#8220;.hr&#8221; for Croatian, &#8220;.rs&#8221; for Serbian), capturing everything from news articles and legal documents to personal blogs and online forums.<\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Classla-web_text_types.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-9045\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Classla-web_text_types.png\" alt=\"Media crawling for Classla web\" width=\"1536\" height=\"1024\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Classla-web_text_types.png 1536w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Classla-web_text_types-300x200.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Classla-web_text_types-1024x683.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Classla-web_text_types-768x512.png 768w\" sizes=\"auto, (max-width: 1536px) 100vw, 1536px\" \/><\/a><\/p>\n<p>Additionally, each corpus is enriched with multiple layers of information, including linguistic annotations, genre metadata and topic classifications.<\/p>\n<div style=\"background-color: #e8f4fd; border-left: 4px solid #2980b9; padding: 15px 20px; margin: 20px 0; border-radius: 4px;\">\n<p><strong>Annotation layers of CLASSLA-web<\/strong><\/p>\n<ul>\n<li><strong>Linguistic annotation<\/strong> \u2013 tokenization, lemmatization and morphosyntactic tagging<\/li>\n<li><strong>Genre<\/strong> \u2013 e.g. News, Forum, Legal, Promotion, Opinion<\/li>\n<li><strong>Topic<\/strong> \u2013 e.g. Sport, Environment, Politics, Health (version 2.0 only)<\/li>\n<\/ul>\n<\/div>\n<p>This makes it appropriate for studying non-standard language, domain-specific vocabulary, genre variation or cross-linguistic comparisons across South Slavic languages, making it especially interesting for researchers, students and enthusiasts in linguistics, digital humanities, corpus linguistics, language teaching and more.<\/p>\n<p>Follow along this easy tutorial if you want to learn how to navigate on NoSketchEngine using the CLASSLA-web corpora.<\/p>\n<h1>1. Access CLASSLA-web<\/h1>\n<h2>1.1 Register on NoSketchEngine (optional for basic use)<\/h2>\n<p>The CLASSLA-web corpora are freely accessible through the CLARIN.SI NoSketchEngine concordancer, which provides a variety of analysis tools. For basic corpus browsing and querying, you can use the standard NoSketchEngine interface without registering. However, some advanced features used later in this tutorial, such as creating your own subcorpora, require a free NoSketchEngine Log (skelog) account. To use these features, create a free account at <a href=\"https:\/\/www.clarin.si\/skelog\/#unauthorized\">https:\/\/www.clarin.si\/skelog\/#unauthorized<\/a> (make sure to enable pop-up windows in your browser). The registration is straightforward:<\/p>\n<div style=\"background-color: #fef9e7; border-left: 4px solid #f39c12; padding: 15px 20px; margin: 20px 0; border-radius: 4px;\"><strong>\u26a0\ufe0f Important<\/strong><br \/>\nYour credentials cannot be retrieved if forgotten. Make sure to store them safely.<\/div>\n<div style=\"background-color: #e8f4fd; border-left: 4px solid #2980b9; padding: 15px 20px; margin: 20px 0; border-radius: 4px;\"><strong>Registration steps<\/strong><\/p>\n<ul>\n<li>Click <strong>&#8220;Sign up now&#8221;<\/strong>, define your username and password and confirm by clicking <strong>&#8220;Register&#8221;<\/strong>.<\/li>\n<li>There will be no confirmation message.<\/li>\n<li>Return to <a href=\"https:\/\/www.clarin.si\/skelog\">https:\/\/www.clarin.si\/skelog<\/a> (again with pop-ups enabled) and log in with the credentials you just created.<\/li>\n<li>If you run into any problems during registration or login, contact <a href=\"mailto:info@clarin.si\">info@clarin.si<\/a>.<\/li>\n<\/ul>\n<\/div>\n<p>The NoSketchEngine concordancer offers valuable tools for web language analysis which can serve as the basis for many use cases. The following tutorial explores the most important tools and offers a general hands-on experience for anyone new to corpus linguistics with no or limited background in programming.<\/p>\n<h2>1.2 Select a corpus<\/h2>\n<p>Once you are logged in, you will land on the NoSketchEngine dashboard (Fig. 1).<\/p>\n<figure id=\"attachment_9013\" aria-describedby=\"caption-attachment-9013\" style=\"width: 687px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-6.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-9013 size-full\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-6.png\" alt=\"Dashboard of NoSketchEngine\" width=\"687\" height=\"405\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-6.png 687w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-6-300x177.png 300w\" sizes=\"auto, (max-width: 687px) 100vw, 687px\" \/><\/a><figcaption id=\"caption-attachment-9013\" class=\"wp-caption-text\">Fig. 1 &#8211; Dashboard<\/figcaption><\/figure>\n<p>Choose the corpus you want to work with by clicking on the search box at the top of the page and either scrolling through the list or typing directly &#8220;CLASSLA-web.&#8221; followed by the country code. The available CLASSLA-web corpora are:<\/p>\n<ul>\n<li><code>CLASSLA-web.sl<\/code> \u2013 Slovenian<\/li>\n<li><code>CLASSLA-web.hr<\/code> \u2013 Croatian<\/li>\n<li><code>CLASSLA-web.bs<\/code> \u2013 Bosnian<\/li>\n<li><code>CLASSLA-web.cnr<\/code> \u2013 Montenegrin<\/li>\n<li><code>CLASSLA-web.sr<\/code> \u2013 Serbian<\/li>\n<li><code>CLASSLA-web.mk<\/code> \u2013 Macedonian<\/li>\n<li><code>CLASSLA-web.bg<\/code> \u2013 Bulgarian<\/li>\n<\/ul>\n<p>The CLASSLA-web corpora are currently available in two versions, differing mainly in the time span of the web-crawling:<\/p>\n<ul>\n<li><strong>Version 1.0:<\/strong> 2021\u20132022 (11 bn. words, 26M texts)<\/li>\n<li><strong>Version 2.0:<\/strong> 2024 (17 bn. words, 38M texts)<\/li>\n<\/ul>\n<div style=\"background-color: #e9f7ef; border-left: 4px solid #27ae60; padding: 15px 20px; margin: 20px 0; border-radius: 4px;\">\n<p><strong>\ud83d\udca1 Tip<\/strong><\/p>\n<p>The two versions share only about 20% of texts, which means they can also be used together for an even larger dataset if needed.<\/p>\n<\/div>\n<p>For this tutorial, choose the Slovenian <strong>CLASSLA-web.sl 2.0<\/strong> corpus.<\/p>\n<h2>1.3 Understanding the data<\/h2>\n<p>Once you have selected your corpus, navigate to <strong>&#8220;Corpus Info&#8221;<\/strong> on the left sidebar to explore the structure of the dataset (Fig. 2).<\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-094905.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-9017\" style=\"width: 100%; height: auto;\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-094905-1024x321.png\" alt=\"Text Types Corpus Info\" width=\"3000\" height=\"941\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-094905-1024x321.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-094905-300x94.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-094905-768x241.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-094905-1536x482.png 1536w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-094905.png 1884w\" sizes=\"auto, (max-width: 3000px) 100vw, 3000px\" \/><\/a> Fig. 2 &#8211; Corpus Info with Text Types<\/p>\n<p>Here you get an overview of the size of the corpus, as well as a list of &#8220;Text types&#8221;, the annotated metadata categories you can filter on. Each corpus comes with basic metadata, including the <strong>web domain and URL<\/strong> of the source text (e.g. &#8220;rtvslo.si&#8221; for texts from the Slovene national broadcaster) and <strong>script<\/strong> (Latin or Cyrillic) for the Serbian, Bosnian and Montenegrin corpora. Beyond this, each corpus is enriched with three additional annotation layers:<\/p>\n<ul>\n<li><strong>Linguistic annotation<\/strong> \u2013 via the CLASSLA-Stanza pipeline, providing tokenization, lemmatization and morphosyntactic tagging<\/li>\n<li><strong>Genre metadata<\/strong> \u2013 via the multilingual X-GENRE classifier, covering nine categories such as News, Promotion, Opinion\/Argumentation, Forum and Legal<\/li>\n<li><strong>Topic metadata<\/strong> \u2013 via a multilingual IPTC news topic classifier, covering 23 labels such as Politics, Health, Crime\/Law and Justice, Science and Technology, Sport and Environment (available in version 2.0 only)<\/li>\n<\/ul>\n<p>Because all seven corpora were collected and processed using the same tools, within the same time frame, and annotated with the same classification systems, <strong>findings across languages can be directly compared<\/strong> without methodological inconsistencies.<\/p>\n<p>To get a feel for the data, click on <strong>&#8220;Text type analysis&#8221;<\/strong> (Fig. 3), which gives you a general distribution overview of the metadata values.<\/p>\n<figure id=\"attachment_9019\" aria-describedby=\"caption-attachment-9019\" style=\"width: 1148px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-04-03-134758.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-9019\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-04-03-134758.png\" alt=\"Text Type Analysis - Distribution of domains\" width=\"1148\" height=\"997\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-04-03-134758.png 1148w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-04-03-134758-300x261.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-04-03-134758-1024x889.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-04-03-134758-768x667.png 768w\" sizes=\"auto, (max-width: 1148px) 100vw, 1148px\" \/><\/a><figcaption id=\"caption-attachment-9019\" class=\"wp-caption-text\">Fig. 3 Text Type Analysis<\/figcaption><\/figure>\n<p>In Fig. 4, for example, you can see the topic distribution in the subcorpus &#8220;RTV Slovenija&#8221;, created by filtering for the domain rtvslo.si.<\/p>\n<figure id=\"attachment_9021\" aria-describedby=\"caption-attachment-9021\" style=\"width: 1145px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-5.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-9021\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-5.png\" alt=\"Distribution of topics\" width=\"1145\" height=\"810\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-5.png 1145w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-5-300x212.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-5-1024x724.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-5-768x543.png 768w\" sizes=\"auto, (max-width: 1145px) 100vw, 1145px\" \/><\/a><figcaption id=\"caption-attachment-9021\" class=\"wp-caption-text\">Fig. 4 &#8211; Distribution of topics for subcorpus &#8220;RTV Slovenija&#8221;<\/figcaption><\/figure>\n<p>For many use cases, it is a good idea to create a <strong>subcorpus<\/strong>. That is a filtered subset of the main corpus. Filtering by a specific topic, genre, or domain allows you to get more targeted results and enables meaningful comparisons. You will learn how to do this in the next section.<\/p>\n<h1>2. Filtering<\/h1>\n<p>To create a subcorpus, click on <strong>&#8220;Manage corpus&#8221;<\/strong> on the main Dashboard (Fig. 1), then on <strong>&#8220;Subcorpora&#8221;<\/strong> (Fig. 5) and <strong>&#8220;Create Subcorpus&#8221;<\/strong> (Fig. 6).<\/p>\n<figure id=\"attachment_9022\" aria-describedby=\"caption-attachment-9022\" style=\"width: 1147px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-4.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-9022\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-4.png\" alt=\"Manage corpus - Subcorpora\" width=\"1147\" height=\"373\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-4.png 1147w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-4-300x98.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-4-1024x333.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-4-768x250.png 768w\" sizes=\"auto, (max-width: 1147px) 100vw, 1147px\" \/><\/a><figcaption id=\"caption-attachment-9022\" class=\"wp-caption-text\">Fig. 5 &#8211; Manage corpus<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<figure id=\"attachment_9023\" aria-describedby=\"caption-attachment-9023\" style=\"width: 1151px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-9023\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-3.png\" alt=\"Create Subcorpus\" width=\"1151\" height=\"265\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-3.png 1151w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-3-300x69.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-3-1024x236.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-3-768x177.png 768w\" sizes=\"auto, (max-width: 1151px) 100vw, 1151px\" \/><\/a><figcaption id=\"caption-attachment-9023\" class=\"wp-caption-text\">Fig. 6 &#8211; Subcorpora<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p>This takes you to an interface where you can choose your filters and give your subcorpus a name.<\/p>\n<p>Depending on your interest, you can select one or several criteria, including topic (e.g. Health, Politics), genre (e.g. News, Forum), or domain name (e.g. 24ur.com). Combining multiple filters is possible and allows for more targeted analyses.<\/p>\n<div style=\"background-color: #e9f7ef; border-left: 4px solid #27ae60; padding: 15px 20px; margin: 20px 0; border-radius: 4px;\">\n<p><strong>Tip<\/strong><\/p>\n<p>Keep in mind that genre distribution is <strong>not<\/strong> equal across the CLASSLA-web corpora. News is by far the dominant genre in most of them.<\/p>\n<\/div>\n<p>For this tutorial, we will create two subcorpora:<\/p>\n<p><strong>(1) Mladina.si News<\/strong> (Fig. 7):<\/p>\n<ul>\n<li>Genre = News<\/li>\n<li>Domain = mladina.si<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" style=\"margin: 10px auto; display: block;\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-2.png\" alt=\"Mladina.si News\" width=\"1161\" height=\"476\" \/><\/p>\n<p style=\"font-size: 0.9em; color: #666; text-align: center;\">Fig. 7 \u2013 Mladina.si News filter<\/p>\n<p>&nbsp;<\/p>\n<p><strong>(2) Dru\u017eina.si News<\/strong> (Fig. 8):<\/p>\n<ul>\n<li>Genre = News<\/li>\n<li>Domain = druzina.si<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<figure id=\"attachment_9025\" aria-describedby=\"caption-attachment-9025\" style=\"width: 1152px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-9025\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-1.png\" alt=\"Dru\u017eina.si News \" width=\"1152\" height=\"502\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-1.png 1152w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-1-300x131.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-1-1024x446.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-1-768x335.png 768w\" sizes=\"auto, (max-width: 1152px) 100vw, 1152px\" \/><\/a><figcaption id=\"caption-attachment-9025\" class=\"wp-caption-text\">Fig. 8 &#8211; Dru\u017eina.si News filter<\/figcaption><\/figure>\n<p>Each subcorpus is filtered to the News genre only: one containing texts from Mladina.si, a left-liberal news outlet, and one from Dru\u017eina.si, a Catholic broadcaster. Filtering by News ensures we are comparing within the same genre, rather than mixing editorial content with promotional or forum content. The two subcorpora are similar in size (Fig. 9).<\/p>\n<figure id=\"attachment_9028\" aria-describedby=\"caption-attachment-9028\" style=\"width: 1148px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-04-03-140702.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-9028\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-04-03-140702.png\" alt=\"Subcorpora overview\" width=\"1148\" height=\"414\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-04-03-140702.png 1148w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-04-03-140702-300x108.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-04-03-140702-1024x369.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-04-03-140702-768x277.png 768w\" sizes=\"auto, (max-width: 1148px) 100vw, 1148px\" \/><\/a><figcaption id=\"caption-attachment-9028\" class=\"wp-caption-text\">Fig. 9 &#8211; Subcorpora overview<\/figcaption><\/figure>\n<p>Using the two subcorpora that we just created, we will investigate the following question:<\/p>\n<div style=\"background-color: #f4f0f9; border-left: 4px solid #8e44ad; padding: 15px 20px; margin: 20px 0; border-radius: 4px;\">\n<p><strong>Research question<\/strong><\/p>\n<p><em>What differences in vocabulary can be observed between the left-leaning news outlet Mladina (mladina.si) and the Catholic weekly Dru\u017eina (druzina.si)? What do the most distinctive keywords and most frequent words reveal about how each source covers the news?<\/em><\/p>\n<\/div>\n<h1>3. Keywords<\/h1>\n<p>A word is considered a <strong>keyword<\/strong> when it is statistically overrepresented in a focus corpus compared to a reference corpus, meaning it appears more often than we would expect by chance. NoSketchEngine measures this using <strong>log-likelihood<\/strong>, a statistical metric that indicates how significant the difference in frequency is between the two corpora. Rather than simply comparing raw frequencies, log-likelihood takes into account the size of both corpora and asks: <em>how probable is it that this difference in frequency occurred by chance alone?<\/em> The higher the score, the more confident we can be that a word genuinely characterises the focus corpus.<\/p>\n<p>The Keywords tool (accessible from the main menu or from the sidebar, Fig. 10) compares different corpora to find words that appear unusually frequently in one compared to the other(s), in other words, words that are specific to the focus corpus.<\/p>\n<figure id=\"attachment_9029\" aria-describedby=\"caption-attachment-9029\" style=\"width: 1154px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-9029\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design.png\" alt=\"Sidebar Keyword tool\" width=\"1154\" height=\"565\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design.png 1154w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-300x147.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-1024x501.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-768x376.png 768w\" sizes=\"auto, (max-width: 1154px) 100vw, 1154px\" \/><\/a><figcaption id=\"caption-attachment-9029\" class=\"wp-caption-text\">Fig. 10 &#8211; Sidebar<\/figcaption><\/figure>\n<p>To find keywords characteristic of the Mladina News subcorpus, compared to both the full CLASSLA-web.sl 2.0 corpus and the Dru\u017eina News subcorpus, select the options shown in Fig. 11.<\/p>\n<figure id=\"attachment_9030\" aria-describedby=\"caption-attachment-9030\" style=\"width: 952px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-8.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-9030\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-8.png\" alt=\"Keyword settings\" width=\"952\" height=\"894\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-8.png 952w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-8-300x282.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-8-768x721.png 768w\" sizes=\"auto, (max-width: 952px) 100vw, 952px\" \/><\/a><figcaption id=\"caption-attachment-9030\" class=\"wp-caption-text\">Fig. 11 &#8211; Keyword settings<\/figcaption><\/figure>\n<p>This search will generate a list of 1,000 key-lemmas (Fig. 12).<\/p>\n<figure id=\"attachment_9033\" aria-describedby=\"caption-attachment-9033\" style=\"width: 1881px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-9.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-9033\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-9.png\" alt=\"Mladina.si keywords\" width=\"1881\" height=\"659\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-9.png 1881w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-9-300x105.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-9-1024x359.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-9-768x269.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-9-1536x538.png 1536w\" sizes=\"auto, (max-width: 1881px) 100vw, 1881px\" \/><\/a><figcaption id=\"caption-attachment-9033\" class=\"wp-caption-text\">Fig. 12 &#8211; Mladina.si keywords<\/figcaption><\/figure>\n<p>This keyword list is heavily packed with vocabulary about media, finance and politics. It also contains several proper nouns, some pointing to specific scandals, others referencing famous individuals or news magazines. These thematic clusters suggest that Mladina is strongly oriented towards political, economic and media topics.<\/p>\n<div style=\"background-color: #f4f0f9; border-left: 4px solid #8e44ad; padding: 15px 20px; margin: 20px 0; border-radius: 4px;\">\n<p><strong>Insight into\u00a0 Mladina.si&#8217;s Keyword Glossary<\/strong><\/p>\n<p><strong>Media<\/strong><\/p>\n<ul>\n<li><code>\u010dlanek<\/code> \u2013 article<\/li>\n<li><code>guardian<\/code> \u2013 The Guardian (British newspaper)<\/li>\n<li><code>spiegel<\/code> \u2013 Der Spiegel (German news magazine)<\/li>\n<\/ul>\n<p><strong>Finance<\/strong><\/p>\n<ul>\n<li><code>delnica<\/code> \u2013 share \/ stock<\/li>\n<li><code>ecb<\/code> \u2013 ECB (European Central Bank)<\/li>\n<li><code>obveznica<\/code> \u2013 bond (financial)<\/li>\n<li><code>privatizacija<\/code> \u2013 privatisation<\/li>\n<\/ul>\n<p><strong>Politics<\/strong><\/p>\n<ul>\n<li><code>jan\u0161ev<\/code> \u2013 relating to Janez Jan\u0161a (Slovenian politician)<\/li>\n<li><code>\u017evi\u017ega\u010d<\/code> \u2013 whistleblower<\/li>\n<li><code>neoliberalen<\/code> \u2013 neoliberal<\/li>\n<li><code>protikorupcijski<\/code> \u2013 anti-corruption<\/li>\n<\/ul>\n<p><strong>Proper nouns<\/strong><\/p>\n<ul>\n<li><code>TE\u0160<\/code> \u2013 \u0160o\u0161tanj Thermal Power Plant (major financial scandal)<\/li>\n<li><code>Snowden<\/code> \u2013 Edward Snowden (NSA whistleblower)<\/li>\n<li><code>Spiegel<\/code> \u2013 Der Spiegel (German investigative news magazine)<\/li>\n<\/ul>\n<\/div>\n<p>To explore any lemma further, click the three dots next to it and select <strong>&#8220;Concordance (focus corpus)&#8221;<\/strong> (Fig. 13).<\/p>\n<figure id=\"attachment_9037\" aria-describedby=\"caption-attachment-9037\" style=\"width: 1890px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-11.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-9037\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-11.png\" alt=\"Concordance\" width=\"1890\" height=\"675\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-11.png 1890w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-11-300x107.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-11-1024x366.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-11-768x274.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-11-1536x549.png 1536w\" sizes=\"auto, (max-width: 1890px) 100vw, 1890px\" \/><\/a><figcaption id=\"caption-attachment-9037\" class=\"wp-caption-text\">Fig. 13 &#8211; Option to see concordances for selected word<\/figcaption><\/figure>\n<p>This displays every occurrence of the selected term in the corpus alongside its surrounding context, the source of each instance and the total frequency count (see Fig. 14).<\/p>\n<figure id=\"attachment_9038\" aria-describedby=\"caption-attachment-9038\" style=\"width: 1894px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-111356.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-9038\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-111356.png\" alt=\"Concordances example\" width=\"1894\" height=\"812\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-111356.png 1894w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-111356-300x129.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-111356-1024x439.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-111356-768x329.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-19-111356-1536x659.png 1536w\" sizes=\"auto, (max-width: 1894px) 100vw, 1894px\" \/><\/a><figcaption id=\"caption-attachment-9038\" class=\"wp-caption-text\">Fig. 14 &#8211; Concordances example<\/figcaption><\/figure>\n<p>If we swap the subcorpora so that Dru\u017eina News becomes the focus corpus, the resulting keyword list looks very different (Fig. 15). These 1,000 keywords are again lemmas, but they almost exclusively represent religious vocabulary, specifically Catholic terms and practices (Vatican, jubilant, \u017eupnijski). Financial and political vocabulary is almost completely absent.<\/p>\n<figure id=\"attachment_9039\" aria-describedby=\"caption-attachment-9039\" style=\"width: 1910px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-12.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-9039\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-12.png\" alt=\"Dru\u017eina .si keywords\" width=\"1910\" height=\"740\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-12.png 1910w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-12-300x116.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-12-1024x397.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-12-768x298.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2026\/05\/Untitled-design-12-1536x595.png 1536w\" sizes=\"auto, (max-width: 1910px) 100vw, 1910px\" \/><\/a><figcaption id=\"caption-attachment-9039\" class=\"wp-caption-text\">Fig. 15 &#8211; Dru\u017eina .si keywords<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<div style=\"background-color: #f4f0f9; border-left: 4px solid #8e44ad; padding: 15px 20px; margin: 20px 0; border-radius: 4px;\">\n<p><strong>Insight into Druzina.si&#8217;s Keyword Glossary<\/strong><\/p>\n<ul>\n<li><code>bogoslu\u017eje<\/code> \u2013 liturgy<\/li>\n<li><code>zakrament<\/code> \u2013 sacrament<\/li>\n<li><code>oltar<\/code> \u2013 altar<\/li>\n<li><code>blagoslov<\/code> \u2013 blessing<\/li>\n<li><code>\u010de\u0161\u010denje<\/code> \u2013 veneration<\/li>\n<\/ul>\n<p><strong>Church structure &amp; community<\/strong><\/p>\n<ul>\n<li><code>\u017eupnija<\/code> \u2013 parish<\/li>\n<li><code>dekanija<\/code> \u2013 deanery<\/li>\n<li><code>diakon<\/code> \u2013 deacon<\/li>\n<li><code>sinoda<\/code> \u2013 synod<\/li>\n<li><code>bogoslovec<\/code> \u2013 theology student<\/li>\n<\/ul>\n<p><strong>Mission &amp; pilgrimage<\/strong><\/p>\n<ul>\n<li><code>misijon<\/code> \u2013 mission<\/li>\n<li><code>misijonar<\/code> \u2013 missionary<\/li>\n<li><code>romanje<\/code> \u2013 pilgrimage<\/li>\n<li><code>romar<\/code> \u2013 pilgrim<\/li>\n<\/ul>\n<p><strong>Proper nouns<\/strong><\/p>\n<ul>\n<li><code>fran\u010di\u0161ek<\/code> \u2013 Francis (Pope Francis)<\/li>\n<li><code>parolin<\/code> \u2013 Parolin (Vatican Secretary of State)<\/li>\n<li><code>kathpress<\/code> \u2013 KathPress (Austrian Catholic news agency)<\/li>\n<\/ul>\n<\/div>\n<p>With just a few clicks, we have now established that mladina.si and druzina.si occupy very different thematic spaces: Mladina is dominated by political and economic debate, while Dru\u017eina is focused on religion and Catholic practice.<\/p>\n<div style=\"background-color: #f0f0f0; border-left: 4px solid #333333; padding: 15px 20px; margin: 20px 0; border-radius: 4px;\">\n<p><strong>\ud83d\udd0d Keywords &amp; Regex<\/strong><\/p>\n<p>To narrow down keyword results by part of speech, change the attribute in the keyword settings from <em>Lemma (base form)<\/em> to <em>Lemma with PoS tag<\/em> and apply one of the following regex patterns:<\/p>\n<ul>\n<li><code>.*-v<\/code> (verbs only)<\/li>\n<li><code>.*-n<\/code> (nouns only)<\/li>\n<li><code>.*-a<\/code> (adjectives only)<\/li>\n<\/ul>\n<p>Verbs can be particularly revealing because they are words of action. They tell us what is happening in the texts rather than just what topics are mentioned. Filtering by part of speech also helps reduce noise in the keyword list, which can otherwise be dominated by proper nouns or function words. By isolating verbs, nouns, or adjectives separately, we get a more focused and interpretable picture of how a corpus is distinctive.<\/p>\n<\/div>\n<h1>4. Conclusion<\/h1>\n<p>This tutorial has walked you through the core steps of corpus analysis using the CLASSLA-web corpus and NoSketchEngine:<\/p>\n<ul>\n<li>Setting up subcorpora<\/li>\n<li>Generating keyword lists<\/li>\n<li>Interpreting frequency patterns<\/li>\n<\/ul>\n<p>The comparison between Mladina and Dru\u017eina has shown how just a handful of tools can quickly reveal meaningful differences in vocabulary and thematic focus between two news websites. These findings are just a starting point. But the same workflow can be applied to any combination of subcorpora, languages or genres that are available in CLASSLA-web. Of course, you can also adapt the pipeline to a variety of research questions: from researching political discourse to studying genre variation or cross-linguistic comparisons. There is plenty more to explore!<\/p>\n<div class=\"mceTemp\"><\/div>\n","protected":false},"excerpt":{"rendered":"<p>The CLASSLA-web corpora are a collection of text from the internet, its newest version covering seven South Slavic languages: Slovenian Croatian Bosnian Montenegrin Serbian Macedonian Bulgarian The second version of the CLASSLA-web corpus collection substantially expands the original release, growing from 13 billion tokens and 26 million documents to more than 17 billion words and [&hellip;]<\/p>\n","protected":false},"author":14,"featured_media":0,"parent":3558,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"class_list":["post-9010","page","type-page","status-publish","hentry","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"aioseo_notices":[],"aioseo_head":"\n\t\t<!-- All in One SEO 4.9.8 - aioseo.com -->\n\t<meta name=\"description\" content=\"The CLASSLA-web corpora are a collection of text from the internet, its newest version covering seven South Slavic languages: Slovenian Croatian Bosnian Montenegrin Serbian Macedonian Bulgarian The second version of the CLASSLA-web corpus collection substantially expands the original release, growing from 13 billion tokens and 26 million documents to more than 17 billion words and\" \/>\n\t<meta name=\"robots\" content=\"max-image-preview:large\" \/>\n\t<meta name=\"google-site-verification\" content=\"LiA10aq97L10baWhrk27m-8KV46nP_6qo6Z8pFmPF88\" \/>\n\t<link rel=\"canonical\" href=\"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/\" \/>\n\t<meta name=\"generator\" content=\"All in One SEO (AIOSEO) 4.9.8\" \/>\n\t\t<meta property=\"og:locale\" content=\"en_GB\" \/>\n\t\t<meta property=\"og:site_name\" content=\"CLARIN Slovenija - Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije\" \/>\n\t\t<meta property=\"og:type\" content=\"article\" \/>\n\t\t<meta property=\"og:title\" content=\"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web - CLARIN Slovenija\" \/>\n\t\t<meta property=\"og:description\" content=\"The CLASSLA-web corpora are a collection of text from the internet, its newest version covering seven South Slavic languages: Slovenian Croatian Bosnian Montenegrin Serbian Macedonian Bulgarian The second version of the CLASSLA-web corpus collection substantially expands the original release, growing from 13 billion tokens and 26 million documents to more than 17 billion words and\" \/>\n\t\t<meta property=\"og:url\" content=\"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/\" \/>\n\t\t<meta property=\"article:published_time\" content=\"2026-05-28T08:11:31+00:00\" \/>\n\t\t<meta property=\"article:modified_time\" content=\"2026-06-17T08:23:53+00:00\" \/>\n\t\t<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n\t\t<meta name=\"twitter:title\" content=\"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web - CLARIN Slovenija\" \/>\n\t\t<meta name=\"twitter:description\" content=\"The CLASSLA-web corpora are a collection of text from the internet, its newest version covering seven South Slavic languages: Slovenian Croatian Bosnian Montenegrin Serbian Macedonian Bulgarian The second version of the CLASSLA-web corpus collection substantially expands the original release, growing from 13 billion tokens and 26 million documents to more than 17 billion words and\" \/>\n\t\t<script type=\"application\/ld+json\" class=\"aioseo-schema\">\n\t\t\t{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/k-centre\\\/classla-web_guide\\\/#breadcrumblist\",\"itemListElement\":[{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info#listItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.clarin.si\\\/info\",\"nextItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/k-centre\\\/#listItem\",\"name\":\"CLASSLA: Knowledge centre for South Slavic languages\"}},{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/k-centre\\\/#listItem\",\"position\":2,\"name\":\"CLASSLA: Knowledge centre for South Slavic languages\",\"item\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/k-centre\\\/\",\"nextItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/k-centre\\\/classla-web_guide\\\/#listItem\",\"name\":\"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web\"},\"previousItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info#listItem\",\"name\":\"Home\"}},{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/k-centre\\\/classla-web_guide\\\/#listItem\",\"position\":3,\"name\":\"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web\",\"previousItem\":{\"@type\":\"ListItem\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/k-centre\\\/#listItem\",\"name\":\"CLASSLA: Knowledge centre for South Slavic languages\"}}]},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/#organization\",\"name\":\"CLARIN Slovenija\",\"description\":\"Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije\",\"url\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"url\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/wp-content\\\/uploads\\\/2014\\\/08\\\/Clarin-SI-logo.png\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/k-centre\\\/classla-web_guide\\\/#organizationLogo\",\"width\":359,\"height\":150},\"image\":{\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/k-centre\\\/classla-web_guide\\\/#organizationLogo\"}},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/k-centre\\\/classla-web_guide\\\/#webpage\",\"url\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/k-centre\\\/classla-web_guide\\\/\",\"name\":\"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web - CLARIN Slovenija\",\"description\":\"The CLASSLA-web corpora are a collection of text from the internet, its newest version covering seven South Slavic languages: Slovenian Croatian Bosnian Montenegrin Serbian Macedonian Bulgarian The second version of the CLASSLA-web corpus collection substantially expands the original release, growing from 13 billion tokens and 26 million documents to more than 17 billion words and\",\"inLanguage\":\"en-GB\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/#website\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/k-centre\\\/classla-web_guide\\\/#breadcrumblist\"},\"datePublished\":\"2026-05-28T08:11:31+00:00\",\"dateModified\":\"2026-06-17T08:23:53+00:00\"},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/#website\",\"url\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/\",\"name\":\"CLARIN Slovenija\",\"description\":\"Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije\",\"inLanguage\":\"en-GB\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.clarin.si\\\/info\\\/#organization\"}}]}\n\t\t<\/script>\n\t\t<!-- All in One SEO -->\n\n","aioseo_head_json":{"title":"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web - CLARIN Slovenija","description":"The CLASSLA-web corpora are a collection of text from the internet, its newest version covering seven South Slavic languages: Slovenian Croatian Bosnian Montenegrin Serbian Macedonian Bulgarian The second version of the CLASSLA-web corpus collection substantially expands the original release, growing from 13 billion tokens and 26 million documents to more than 17 billion words and","canonical_url":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/","robots":"max-image-preview:large","keywords":"","webmasterTools":{"google-site-verification":"LiA10aq97L10baWhrk27m-8KV46nP_6qo6Z8pFmPF88","miscellaneous":""},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"BreadcrumbList","@id":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/#breadcrumblist","itemListElement":[{"@type":"ListItem","@id":"https:\/\/www.clarin.si\/info#listItem","position":1,"name":"Home","item":"https:\/\/www.clarin.si\/info","nextItem":{"@type":"ListItem","@id":"https:\/\/www.clarin.si\/info\/k-centre\/#listItem","name":"CLASSLA: Knowledge centre for South Slavic languages"}},{"@type":"ListItem","@id":"https:\/\/www.clarin.si\/info\/k-centre\/#listItem","position":2,"name":"CLASSLA: Knowledge centre for South Slavic languages","item":"https:\/\/www.clarin.si\/info\/k-centre\/","nextItem":{"@type":"ListItem","@id":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/#listItem","name":"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web"},"previousItem":{"@type":"ListItem","@id":"https:\/\/www.clarin.si\/info#listItem","name":"Home"}},{"@type":"ListItem","@id":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/#listItem","position":3,"name":"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web","previousItem":{"@type":"ListItem","@id":"https:\/\/www.clarin.si\/info\/k-centre\/#listItem","name":"CLASSLA: Knowledge centre for South Slavic languages"}}]},{"@type":"Organization","@id":"https:\/\/www.clarin.si\/info\/#organization","name":"CLARIN Slovenija","description":"Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije","url":"https:\/\/www.clarin.si\/info\/","logo":{"@type":"ImageObject","url":"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2014\/08\/Clarin-SI-logo.png","@id":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/#organizationLogo","width":359,"height":150},"image":{"@id":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/#organizationLogo"}},{"@type":"WebPage","@id":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/#webpage","url":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/","name":"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web - CLARIN Slovenija","description":"The CLASSLA-web corpora are a collection of text from the internet, its newest version covering seven South Slavic languages: Slovenian Croatian Bosnian Montenegrin Serbian Macedonian Bulgarian The second version of the CLASSLA-web corpus collection substantially expands the original release, growing from 13 billion tokens and 26 million documents to more than 17 billion words and","inLanguage":"en-GB","isPartOf":{"@id":"https:\/\/www.clarin.si\/info\/#website"},"breadcrumb":{"@id":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/#breadcrumblist"},"datePublished":"2026-05-28T08:11:31+00:00","dateModified":"2026-06-17T08:23:53+00:00"},{"@type":"WebSite","@id":"https:\/\/www.clarin.si\/info\/#website","url":"https:\/\/www.clarin.si\/info\/","name":"CLARIN Slovenija","description":"Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije","inLanguage":"en-GB","publisher":{"@id":"https:\/\/www.clarin.si\/info\/#organization"}}]},"og:locale":"en_GB","og:site_name":"CLARIN Slovenija - Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije","og:type":"article","og:title":"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web - CLARIN Slovenija","og:description":"The CLASSLA-web corpora are a collection of text from the internet, its newest version covering seven South Slavic languages: Slovenian Croatian Bosnian Montenegrin Serbian Macedonian Bulgarian The second version of the CLASSLA-web corpus collection substantially expands the original release, growing from 13 billion tokens and 26 million documents to more than 17 billion words and","og:url":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/","article:published_time":"2026-05-28T08:11:31+00:00","article:modified_time":"2026-06-17T08:23:53+00:00","twitter:card":"summary_large_image","twitter:title":"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web - CLARIN Slovenija","twitter:description":"The CLASSLA-web corpora are a collection of text from the internet, its newest version covering seven South Slavic languages: Slovenian Croatian Bosnian Montenegrin Serbian Macedonian Bulgarian The second version of the CLASSLA-web corpus collection substantially expands the original release, growing from 13 billion tokens and 26 million documents to more than 17 billion words and"},"aioseo_meta_data":{"post_id":"9010","title":null,"description":null,"keywords":null,"keyphrases":{"focus":{"keyphrase":"","score":0,"analysis":{"keyphraseInTitle":{"score":0,"maxScore":9,"error":1}}},"additional":[]},"primary_term":null,"canonical_url":null,"og_title":null,"og_description":null,"og_object_type":"default","og_image_type":"default","og_image_custom_url":null,"og_image_custom_fields":null,"og_image_url":null,"og_image_width":null,"og_image_height":null,"og_video":"","og_custom_url":null,"og_article_section":null,"og_article_tags":null,"twitter_use_og":false,"twitter_card":"default","twitter_image_type":"default","twitter_image_custom_url":null,"twitter_image_custom_fields":null,"twitter_image_url":null,"twitter_title":null,"twitter_description":null,"schema_type":"default","schema_type_options":null,"schema":{"blockGraphs":[],"customGraphs":[],"default":{"data":{"Article":[],"Course":[],"Dataset":[],"FAQPage":[],"Movie":[],"Person":[],"Product":[],"ProductReview":[],"Car":[],"Recipe":[],"Service":[],"SoftwareApplication":[],"WebPage":[]},"graphName":"WebPage","isEnabled":true},"graphs":[]},"pillar_content":false,"robots_default":true,"robots_noindex":false,"robots_noarchive":false,"robots_nosnippet":false,"robots_nofollow":false,"robots_noimageindex":false,"robots_noodp":false,"robots_notranslate":false,"robots_max_snippet":"-1","robots_max_videopreview":"-1","robots_max_imagepreview":"large","priority":null,"frequency":"default","local_seo":null,"limit_modified_date":false,"ai":{"faqs":[],"keyPoints":[],"schemas":[],"titles":[],"descriptions":[],"socialPosts":{"email":[],"linkedin":[],"twitter":[],"facebook":[],"instagram":[]}},"breadcrumb_settings":null,"seo_analyzer_scan_date":null,"created":"2026-05-28 07:45:35","updated":"2026-06-17 08:43:52"},"aioseo_breadcrumb":"<div class=\"aioseo-breadcrumbs\"><span class=\"aioseo-breadcrumb\">\n\t\t\t<a href=\"https:\/\/www.clarin.si\/info\" title=\"Home\">Home<\/a>\n\t\t<\/span><span class=\"aioseo-breadcrumb-separator\">&raquo;<\/span><span class=\"aioseo-breadcrumb\">\n\t\t\t<a href=\"https:\/\/www.clarin.si\/info\/k-centre\/\" title=\"CLASSLA: Knowledge centre for South Slavic languages\">CLASSLA: Knowledge centre for South Slavic languages<\/a>\n\t\t<\/span><span class=\"aioseo-breadcrumb-separator\">&raquo;<\/span><span class=\"aioseo-breadcrumb\">\n\t\t\tExploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web\n\t\t<\/span><\/div>","aioseo_breadcrumb_json":[{"label":"Home","link":"https:\/\/www.clarin.si\/info"},{"label":"CLASSLA: Knowledge centre for South Slavic languages","link":"https:\/\/www.clarin.si\/info\/k-centre\/"},{"label":"Exploring Internet Language: A Step-by-Step Guide to Corpus Analysis with CLASSLA-web","link":"https:\/\/www.clarin.si\/info\/k-centre\/classla-web_guide\/"}],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/9010","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=9010"}],"version-history":[{"count":62,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/9010\/revisions"}],"predecessor-version":[{"id":9141,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/9010\/revisions\/9141"}],"up":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3558"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=9010"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}