{"id":3558,"date":"2019-03-20T13:34:19","date_gmt":"2019-03-20T13:34:19","guid":{"rendered":"http:\/\/www.clarin.si\/info\/?page_id=3558"},"modified":"2026-02-12T13:30:31","modified_gmt":"2026-02-12T13:30:31","slug":"k-centre","status":"publish","type":"page","link":"https:\/\/www.clarin.si\/info\/k-centre\/","title":{"rendered":"CLASSLA: Knowledge centre for South Slavic languages"},"content":{"rendered":"<p>The CLARIN Knowledge Centre for South Slavic languages (CLASSLA) offers <strong>expertise on language resources and technologies for South Slavic languages<\/strong>. Its basic activities are (1) giving researchers, students, citizen scientists and other interested parties information on the available resources and technologies via its documentation, (2) supporting them in producing, modifying or publishing resources and technologies via its helpdesk and (3) organizing training activities. <span style=\"font-weight: 400;\">Read more about CLASSLA\u2019s activities and its mission in a Tour de CLARIN article, published <\/span><a href=\"https:\/\/www.clarin.eu\/blog\/tour-de-clarin-clarin-knowledge-centre-south-slavic-languages-classla\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p>CLASSLA can be contacted via <strong><a href=\"mailto:helpdesk.classla@clarin.si\">helpdesk.classla@clarin.si<\/a><\/strong>. The <strong>helpdesk<\/strong> offers additional clarifications regarding the CLASSLA documentation (detailed below) and support in using, modifying, producing, or publishing resources and technologies for South Slavic languages.<\/p>\n<p>The Knowledge Centre currently offers <strong>frequently asked questions (FAQ) documentation<\/strong> for the following languages: <a href=\"http:\/\/www.clarin.si\/info\/k-centre\/faq4slovene\">Slovene<\/a>, <a href=\"http:\/\/www.clarin.si\/info\/k-centre\/faq4croatian\">Croatian,<\/a>\u00a0<a href=\"http:\/\/www.clarin.si\/info\/k-centre\/faq4serbian\">Serbian<\/a>, <a href=\"http:\/\/www.clarin.si\/info\/k-centre\/faq4bulgarian\/\">Bulgarian<\/a>, and <a href=\"https:\/\/www.clarin.si\/info\/k-centre\/faq4macedonian\/\">Macedonian<\/a>. It also offers <a href=\"http:\/\/www.clarin.si\/info\/k-centre\/web-services-documentation\/\">documentation<\/a> on how to use CLARIN.SI web services which currently cover Slovene, Croatian and Serbian.<\/p>\n<p>The most relevant announcements, discussed in <strong>our mailing list<\/strong>, are made available below. You can <strong><a href=\"https:\/\/mailman.ijs.si\/mailman\/listinfo\/classla\" target=\"_blank\" rel=\"noopener\">subscribe to the mailing list here<\/a>\u00a0<\/strong>to be informed of new resources, technologies, events and projects for South Slavic languages.<\/p>\n<p>Stay updated on the latest activities of the CLASSLA Knowledge Centre and the CLARIN.SI infrastructure by following:<\/p>\n<ul>\n<li>CLARIN.SI on <a href=\"https:\/\/x.com\/ClarinSlovenia\" target=\"_blank\" rel=\"noopener\">X<\/a> and <a href=\"https:\/\/www.linkedin.com\/company\/clarin-si\" target=\"_blank\" rel=\"noopener\">LinkedIn<\/a><\/li>\n<li>the <a href=\"https:\/\/discord.com\/invite\/vQDRpGMU7C\" target=\"_blank\" rel=\"noopener\">Discord group &#8220;Slovenska skupnost za jezikovne vire in tehnologije&#8221;<\/a><\/li>\n<\/ul>\n<p>You can access the new resources and technologies developed by researchers from CLASSLA and CLARIN.SI in the following repositories:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.clarin.si\/repository\/xmlui\/?locale-attribute=sl\" target=\"_blank\" rel=\"noopener\">the CLARIN.SI repository<\/a><\/li>\n<li><a href=\"https:\/\/huggingface.co\/classla\" target=\"_blank\" rel=\"noopener\">HuggingFace (organisation profile CLASSLA)<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/clarinsi\" target=\"_blank\" rel=\"noopener\">GitHub<\/a><\/li>\n<\/ul>\n<p>CLASSLA is operated by <a href=\"http:\/\/www.clarin.si\">CLARIN.SI<\/a>, the <a href=\"http:\/\/ihjj.hr\/\" target=\"_blank\" rel=\"noopener\">Institute of Croatian Language<\/a>, and <a href=\"https:\/\/clada-bg.eu\" target=\"_blank\" rel=\"noopener\">CLADA-BG<\/a>.<\/p>\n<p>The CLASSLA K-Centre is part of the CLARIN ERIC K-Centres network, offering expertise in various languages, language technologies, resources, and services. Learn more about other K-Centres that might be relevant for you in the <a href=\"https:\/\/www.clarin.eu\/k-centre-catalogue\" target=\"_blank\" rel=\"noopener\">CLARIN K-Centre Catalogue<\/a>.<\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2024\/03\/CLASSLA-k-centre-transparent-background.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-7196 aligncenter\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2024\/03\/CLASSLA-k-centre-transparent-background-300x99.png\" alt=\"\" width=\"300\" height=\"99\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2024\/03\/CLASSLA-k-centre-transparent-background-300x99.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2024\/03\/CLASSLA-k-centre-transparent-background-1024x336.png 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2024\/03\/CLASSLA-k-centre-transparent-background-768x252.png 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2024\/03\/CLASSLA-k-centre-transparent-background.png 1419w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<h2>CLASSLA Blog Posts<\/h2>\n<ul>\n<li><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/comparable-classla-web-corpora-of-south-slavic-languages\/\" target=\"_blank\" rel=\"noopener\">Comparable CLASSLA web corpora of South Slavic languages<\/a> (December 5, 2023; 3-minutes read)<\/li>\n<li><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers\/\" target=\"_blank\" rel=\"noopener\">CLASSLA-web: Bigger and Better Web Corpora for Croatian, Serbian and Slovenian on CLARIN.SI Concordancers<\/a> (June 22, 2023; 10-minutes read)<\/li>\n<\/ul>\n<h2>Recent Announcements<\/h2>\n\n<h3><strong>February 12, 2026 \u2013 CLASSLA LLM Evaluation Dashboard for South Slavic Languages<\/strong><\/h3>\n<p>Are you wondering which large language model performs the best on South Slavic languages and dialects? Or what is the performance of models for sentiment, topic, genre classification and commonsense reasoning tasks, especially compared to previous state-of-the-art models that have been especially trained on these tasks?<\/p>\n<p>At CLASSLA and CLARIN.SI, we have now set up an interactive dashboard that allows you to search through the results of our evaluation of large language models on various tasks. The models are evaluated on carefully manually-annotated datasets most of which we developed in various previous projects. You are warmly invited to visit the dashboard here: <a href=\"https:\/\/www.clarin.si\/classla-llm-dashboard\/\" target=\"_blank\" rel=\"noopener\">https:\/\/www.clarin.si\/classla-llm-dashboard\/<\/a><\/p>\n<p>For more details:<br \/>\n&#8211; Read the paper on the benchmarking experiments and results <a href=\"https:\/\/arxiv.org\/abs\/2511.07989\" target=\"_blank\" rel=\"noopener\">&#8220;State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?&#8221;<\/a> (Taja Kuzman Punger\u0161ek, Peter Rupnik, Ivan Porupski, Vuk Dini\u0107, Nikola Ljube\u0161i\u0107, 2025)<br \/>\n&#8211; Consult the <a href=\"https:\/\/github.com\/TajaKuzman\/Benchmarking-Text-Classification-on-South-Slavic\/\" target=\"_blank\" rel=\"noopener\">code for the experiments published on GitHub<\/a><\/p>\n<h3><strong>December 29, 2025 \u2013 CLASSLA Annual Recap: 2025 in Review<\/strong><\/h3>\n<p dir=\"ltr\">As we wrap up another eventful year, we would like to share an overview of the key developments and activities at the CLASSLA Knowledge Centre for South Slavic Languages during 2025.<\/p>\n<p dir=\"ltr\"><b>CLASSLA-web corpora for South Slavic languages<\/b><\/p>\n<p dir=\"ltr\">We are excited to announce that we have released the second version of the CLASSLA-web corpora, comprising texts that were collected from the web in 2024. You can now query the new corpora on the CLARIN.SI concordancer (<a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb2_bs\" target=\"_blank\" rel=\"noopener\">Bosnian<\/a>, <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb2_bg\" target=\"_blank\" rel=\"noopener\">Bulgarian<\/a>, <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb2_hr\" target=\"_blank\" rel=\"noopener\">Croatian<\/a>, <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb2_mk\" target=\"_blank\" rel=\"noopener\">Macedonian<\/a>, <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb2_cnr\" target=\"_blank\" rel=\"noopener\">Montenegrin<\/a>, <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb2_sr\" target=\"_blank\" rel=\"noopener\">Serbian<\/a>, and <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb2_sl\" target=\"_blank\" rel=\"noopener\">Slovenian<\/a> corpora), <a href=\"http:\/\/hdl.handle.net\/11356\/2079\" target=\"_blank\" rel=\"noopener\">download them from the CLARIN.SI repository<\/a> or find more information about both 1.0 and 2.0 versions of CLASSLA-web corpora on a new website: <a class=\"moz-txt-link-freetext\" href=\"https:\/\/clarinsi.github.io\/classla-web\/\" target=\"_blank\" rel=\"noopener\">https:\/\/clarinsi.github.io\/classla-web\/<\/a><\/p>\n<p dir=\"ltr\">Although collected from the same national domains as version 1.0 from 2021 and 2022, the new release is substantially larger and contains mostly new material: around 50% more texts and words, totalling 38 million texts and 17 billion words across seven South Slavic languages. The corpora are linguistically annotated with an <a href=\"https:\/\/zenodo.org\/records\/13936406\" target=\"_blank\" rel=\"noopener\">improved CLASSLA-Stanza<\/a> tool (<a href=\"https:\/\/clarin.si\/oznacevalnik\/eng\" target=\"_blank\" rel=\"noopener\">available as a service here<\/a>) and a multilingual genre classifier <a href=\"https:\/\/huggingface.co\/classla\/xlm-roberta-base-multilingual-text-genre-classifier\" target=\"_blank\" rel=\"noopener\">X-GENRE<\/a>. In addition, version 2.0 now also includes topic labels based on our <a href=\"https:\/\/huggingface.co\/classla\/multilingual-IPTC-news-topic-classifier\" target=\"_blank\" rel=\"noopener\">multilingual news topic classifier<\/a>. The corpora are <a href=\"http:\/\/hdl.handle.net\/11356\/2079\" target=\"_blank\" rel=\"noopener\">available on the CLARIN.SI repository<\/a> in JSONL and linguistically-annotated VERT formats.<\/p>\n<p dir=\"ltr\"><b>CLASSLA-Express workshop series<\/b><\/p>\n<p dir=\"ltr\">Our <a href=\"https:\/\/www.clarin.si\/info\/k-centre\/workshops\/classla-express\/\" target=\"_blank\" rel=\"noopener\">CLASSLA-Express workshop<\/a> programme expanded both in content and geography. This year, seven workshops were held across four countries \u2013 Austria, Bulgaria, Croatia, and Slovenia \u2013 led primarily by Ivana Filipovi\u0107 Petrovi\u0107 and Jelena Parizoska, with contributions from Petya Osenova and local organizers. In addition to demonstrating the use of CLARIN.SI concordancers and the CLASSLA-web corpora, the workshops introduced new topics with a strong focus on applying modern AI methods in linguistic research. We are delighted by the continued interest and encourage you to explore the <u> <a href=\"https:\/\/www.clarin.si\/info\/k-centre\/workshops\/\" target=\"_blank\" rel=\"noopener\">detailed workshop reports<\/a><\/u> available on our website. You are warmly invited to stay tuned: CLASSLA-Express 3.0, with a new focus on spoken corpora, is already on the horizon.<\/p>\n<p dir=\"ltr\"><b>Benchmarking large language models for South Slavic languages and dialects<\/b><\/p>\n<p dir=\"ltr\">Evaluation of large language models (LLMs) continued to be one of our key activities. This year, we participated in development of multiple South Slavic benchmarks for LLM evaluation, including the <a href=\"https:\/\/arxiv.org\/abs\/2510.24081\" target=\"_blank\" rel=\"noopener\">Global-PIQA<\/a> test set, a multilingual commonsense reasoning benchmark developed by 335 co-authors and covering 116 languages and dialects, including standard South Slavic languages, as well as Torlak, Chakavian, and the Slovenian Cerkno dialects.<\/p>\n<p dir=\"ltr\">In parallel, we launched an <a href=\"https:\/\/www.clarin.si\/classla-llm-dashboard\/\" target=\"_blank\" rel=\"noopener\">interactive platform presenting evaluation results for South Slavic languages and dialects across six tasks<\/a>: two commonsense reasoning benchmark families (COPA and PIQA), sentiment classification, news topic classification, and automatic genre identification. The platform enables researchers and developers to compare large language model performance, identify strengths and weaknesses, and follow developments over time. To support further experimentation and application, we provide an <u> <a href=\"https:\/\/arxiv.org\/abs\/2511.07989\" target=\"_blank\" rel=\"noopener\">accompanying paper with an overview of current model performance<\/a><\/u> as well as <a href=\"https:\/\/github.com\/TajaKuzman\/Benchmarking-Text-Classification-on-South-Slavic\" target=\"_blank\" rel=\"noopener\">open-source code<\/a> for running evaluations and adapting LLMs to new tasks. We are excited to continue our benchmarking activities as part of the <a href=\"https:\/\/www.cjvt.si\/llm4dh\/en\/\" target=\"_blank\" rel=\"noopener\">LLM4DH<\/a> and <a href=\"https:\/\/alt-edic.eu\/projects\/llms4eu\/\" target=\"_blank\" rel=\"noopener\">LLMs4EU<\/a> projects, which will extend over the next few years.<\/p>\n<p dir=\"ltr\"><b>Speech corpora and technologies<\/b><\/p>\n<p dir=\"ltr\">Our efforts in speech resources advanced significantly this year, with a major focus on expanding and enriching parliamentary speech corpora. A key achievement was the release of <a href=\"https:\/\/clarinsi.github.io\/parlaspeech\/\" target=\"_blank\" rel=\"noopener\">ParlaSpeech 3.0<\/a>, a multilingual collection covering Croatian, Serbian, Czech, and Polish parliamentary proceedings. In the new release, ParlaSpeech has been extended with five annotation layers: linguistic annotation, sentiment labels, filled-pause detection, precise word-level alignments, and primary stress information. These enrichment layers have been added automatically with cutting-edge models for processing speech and text, most of which can be found on the <a href=\"https:\/\/huggingface.co\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA Hugging Face page<\/a>. The enrichments enable advanced studies of prosody, disfluency patterns, and multimodal aspects of parliamentary speech. In addition to the <a href=\"http:\/\/hdl.handle.net\/11356\/1833\" target=\"_blank\" rel=\"noopener\">CLARIN.SI repository<\/a>, the corpora are now accessible through the CLARIN.SI concordancers (<a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=parlaspeech3_hr\" target=\"_blank\" rel=\"noopener\">Croatian<\/a>, <a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=parlaspeech3_rs\" target=\"_blank\" rel=\"noopener\">Serbian<\/a>, <a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=parlaspeech3_cz\" target=\"_blank\" rel=\"noopener\">Czech<\/a> and <a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=parlaspeech3_pl\" target=\"_blank\" rel=\"noopener\">Polish<\/a>), accompanied by a <a href=\"https:\/\/clarinsi.github.io\/parlaspeech\/concordancer\/concordancer-guide.html\" target=\"_blank\" rel=\"noopener\">tutorial on how to query them<\/a>.<\/p>\n<p dir=\"ltr\"><b>Supporting SSH researchers in working with large language models<\/b><\/p>\n<p dir=\"ltr\">As part of the newly established <a href=\"https:\/\/llms4ssh.clarin-pl.eu\/\" target=\"_blank\" rel=\"noopener\">LLMs4SSH<\/a> CLARIN Knowledge Centre, we contributed expertise to help researchers in the social sciences and humanities navigate the rapidly evolving landscape of large language models. Our contributions included an <a href=\"https:\/\/www.clarin.si\/info\/k-centres\/llms4ssh-clarin-k-centre-for-large-language-models-in-ssh\/\" target=\"_blank\" rel=\"noopener\">overview of Slovenian activities, technologies, and datasets related to LLM development<\/a>; a <a href=\"https:\/\/arxiv.org\/abs\/2510.24450\" target=\"_blank\" rel=\"noopener\">proposal for a new taxonomy for LLM evaluation datasets<\/a>; and a concept for a European database offering a clear map of available resources by language and evaluation task.<\/p>\n<p dir=\"ltr\"><b>Models and datasets on Hugging Face<\/b><\/p>\n<p dir=\"ltr\">This year, numerous new models and datasets were released to the <a href=\"https:\/\/huggingface.co\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA Hugging Face page<\/a>, including the first <a href=\"https:\/\/huggingface.co\/classla\/multilingual-IPTC-news-topic-classifier\" target=\"_blank\" rel=\"noopener\">openly-available multilingual IPTC news topic classifier<\/a>, which has already surpassed 600,000 downloads. We are thrilled to see such strong uptake and will continue expanding the collection of openly accessible tools and corpora for South Slavic languages and beyond.<\/p>\n<p dir=\"ltr\"><b>Looking ahead<\/b><\/p>\n<p dir=\"ltr\">As we reflect on this year\u2019s achievements, we extend our sincere thanks to all team members and collaborators who have contributed to our activities, and to the users\u00a0who uptake on our resources. Your engagement and feedback drive our continued commitment to supporting linguistic research and technology development for South Slavic languages.<\/p>\n<p dir=\"ltr\">We look forward to another productive year filled with exciting advances and new collaborations. Wishing you a successful and inspiring year ahead!<\/p>\n<h3><strong>October 17, 2025 \u2013 Parliamentary ParlaCAP Dataset and CAP Topic Classifier<\/strong><\/h3>\n<p>CLARIN.SI is pleased to announce the release of the <a href=\"https:\/\/doi.org\/10.23669\/1ZTELP\" target=\"_blank\" rel=\"noopener\">ParlaCAP dataset<\/a>: an extension of the <a href=\"https:\/\/hdl.handle.net\/11356\/2004\" target=\"_blank\" rel=\"noopener\">ParlaMint 5.0<\/a> collection enriched with sentiment and topic annotations, as well as extended metadata on parties and democracies. The dataset contains around 8 million speeches from 28 European parliaments, and is provided in a tabular format, enhancing the usability of the ParlaMint corpora for social and political science research. As part of the <a href=\"https:\/\/oscars-project.eu\/projects\/parlacap-comparing-agenda-settings-across-parliaments-parlamint-dataset\" target=\"_blank\" rel=\"noopener\">OSCARS ParlaCAP project<\/a>, the dataset was published through the Croatian CESSDA node <a href=\"https:\/\/www.crossda.hr\/\" target=\"_blank\" rel=\"noopener\">CROSSDA<\/a>, promoting thereby collaboration between infrastructures. We also released the <a href=\"https:\/\/huggingface.co\/classla\/ParlaCAP-Topic-Classifier\" target=\"_blank\" rel=\"noopener\">multilingual topic classifier<\/a> using the CAP (Comparative Agendas Project) labels, and <a href=\"https:\/\/github.com\/clarinsi\/ParlaCAP-Analysis-Tutorials\" target=\"_blank\" rel=\"noopener\">tutorials for analysing ParlaCAP data in Python<\/a>. More information is available <a href=\"https:\/\/www.clarin.eu\/sites\/default\/files\/18-Bazaar-Ljubesic.pdf\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<h3><strong>June 2, 2025 \u2013 An overview of LLM activities in Slovenia (LLMs4SSH Knowledge Centre)<\/strong><\/h3>\n<p>At CLARIN.SI, we have launched a new webpage highlighting current research activities related to large language models (LLMs) in Slovenia, created in the context of our membership in the <a href=\"https:\/\/llms4ssh.clarin-pl.eu\/\" target=\"_blank\" rel=\"noopener\">CLARIN ERIC LLMs4SSH Knowledge Centre<\/a>.<\/p>\n<p>Explore the page here: <a href=\"https:\/\/www.clarin.si\/info\/k-centres\/llms4ssh-clarin-k-centre-for-large-language-models-in-ssh\/\" target=\"_blank\" rel=\"noopener\">https:\/\/www.clarin.si\/info\/k-centres\/llms4ssh-clarin-k-centre-for-large-language-models-in-ssh\/<\/a><\/p>\n<p>The page is also available in Slovenian language: <a href=\"https:\/\/www.clarin.si\/info\/k-centri\/llms4ssh-sredisce-znanja-za-velike-jezikovne-modele-za-druzboslovje-in-humanistiko\/\" target=\"_blank\" rel=\"noopener\">https:\/\/www.clarin.si\/info\/k-centri\/llms4ssh-sredisce-znanja-za-velike-jezikovne-modele-za-druzboslovje-in-humanistiko\/<\/a><\/p>\n<p>The site brings together essential information on:<\/p>\n<ul>\n<li>key projects focusing on LLMs in which Slovenia is participating,<\/li>\n<li>existing benchmarks for LLMs in Slovenian,<\/li>\n<li>benchmarks and datasets for LLM evaluation in South Slavic and other languages provided by the CLASSLA Knowledge Centre,<\/li>\n<li>the openly-available large language models, speech technologies, and other natural-language processing (NLP) technologies for Slovenian &#8211; <a href=\"https:\/\/github.com\/clarinsi\/Slovenian-Language-Technologies-Overview\/\" target=\"_blank\" rel=\"noopener\">see the overview here<\/a>.<\/li>\n<\/ul>\n<p>We invite researchers, developers, and anyone interested in LLMs for Slovenian and other languages to explore the webpage and make use of these resources.<\/p>\n<p>If there&#8217;s important information or a resource we&#8217;ve missed, we would be happy to include it &#8211; just get in touch via our helpdesk (helpdesk.classla@clarin.si).<\/p>\n<p>&nbsp;<\/p>\n<h3><strong>May 15, 2025 \u2013 CLASSLA-Express 2.0 in Zagreb: Corpora vs. Large Language Models<\/strong><\/h3>\n<p>We invite you to attend the workshop \u201cCLASSLA-Express 2.0: Corpora vs. Large Language Models\u201d, taking place as a pre-conference event at the <a href=\"https:\/\/hdpl.hr\/hdpl-conference-2025\/\" target=\"_blank\" rel=\"noopener\">39th International Conference of the Croatian Association for Applied Linguistics<\/a>.<\/p>\n<p>The workshop will take place on June 11, from 9:00 to 13:00, at the Faculty of Humanities and Social Sciences, University of Zagreb. Participation is free of charge. The workshop will be given in Croatian language.<\/p>\n<p>Registration is required by June 3. More information and the registration link are provided <a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2025\/05\/CLASSLA-express-2.0_Zagreb-HDPL.pdf\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<p>Convened by Slobodan Beliga, Ivana Filipovi\u0107 Petrovi\u0107, and Jelena Parizoska, the workshop focuses on the comparative analysis of linguistic data derived from traditional corpora and outputs generated by large language models. Special attention will be given to South Slavic languages and the handling of phraseme constructions. Participants will compare results retrieved from the CLASSLA-web corpora (using the CLARIN.SI NoSketch Engine concordancer) with those generated through platforms that use large language models (such as ChatGPT).<\/p>\n<p>The workshop pursues two main objectives: to evaluate the effectiveness of language technologies in handling low- and medium-resource languages, and to explore their potential in processing complex multi-word expressions and semantic ambiguity.<\/p>\n<p>You can find more information about this CLASSLA-Express workshop, as well as upcoming workshops in the coming months, at the following link: <a href=\"https:\/\/www.clarin.si\/info\/k-centre\/workshops\/classla-express\/\" target=\"_blank\" rel=\"noopener\">https:\/\/www.clarin.si\/info\/k-centre\/workshops\/classla-express\/<\/a><\/p>\n<p>We hope to see you there!<\/p>\n<h3><strong>April 2, 2025 \u2013 <\/strong><strong>CLASSLA-Express is back!<\/strong><\/h3>\n<p>We are excited to announce the next iteration of CLASSLA-Express workshops, a hands-on series designed to explore the use of CLASSLA-web corpora in linguistic research!<\/p>\n<p>After a successful 2024 edition, which included workshops in 8 cities across 5 countries (see the reports <a href=\"https:\/\/www.clarin.si\/info\/k-centre\/workshops\/#April_to_November_2024_CLASSLA-Express_Workshops_on_using_CLARINSI_corpora_in_language_research\" target=\"_blank\" rel=\"noopener\">here<\/a>), this year, the CLASSLA-Express route has expanded to include <strong>an additional country and 3 new cities<\/strong>: Klagenfurt and Graz in Austria, and Bled in Slovenia. In addition, the CLASSLA-Express series will visit Zagreb and Rijeka in Croatia again, now with a new focus. The workshops will continue to explore corpus-linguistic research using CLASSLA-web corpora for South Slavic languages, while also testing how large language models (LLMs) perform linguistic tasks, <strong>integrating thereby AI tools with traditional methods<\/strong>.<\/p>\n<p>This years&#8217; iteration offers two workshop formats:<\/p>\n<ul>\n<li><strong>CLASSLA-Express 1.0<\/strong> that provides an introduction to corpus-linguistic research methods, exploring word meanings, collocations, and lexico-grammatical patterns in different text types.<\/li>\n<li><strong>CLASSLA-Express 2.0<\/strong> that focuses on the application of LLMs to linguistic tasks, comparing AI-based approaches with traditional corpus methods and contributing to the development of a framework for combining the two toolsets.<\/li>\n<\/ul>\n<p>Workshops are free of charge and open to university students, linguists, lexicographers, language teachers, digital humanities scholars, and others.<\/p>\n<p>Key dates for CLASSLA-Express workshops in 2025:<\/p>\n<ul>\n<li><strong>4 April 2025<\/strong> \u2013 CLASSLA-Express 1.0 in <strong>Klagenfurt, Austria<\/strong> (University of Klagenfurt).<\/li>\n<li><strong>11 June 2025<\/strong> \u2013 CLASSLA-Express 2.0 in <strong>Zagreb, Croatia<\/strong>, as part of HDPL 2025 (Faculty of Humanities and Social Sciences, University of Zagreb).<\/li>\n<li><strong>10 October 2025<\/strong> \u2013 CLASSLA-Express 1.0 in <strong>Graz, Austria<\/strong> (Institute for Slavic Studies, University of Graz).<\/li>\n<li><strong>5 November 2025<\/strong> \u2013 CLASSLA-Express 2.0 in <strong>Rijeka, Croatia<\/strong>, as part of CLARC 2025 (Faculty of Humanities and Social Sciences, University of Rijeka).<\/li>\n<li><strong>17 November 2025<\/strong> \u2013 CLASSLA-Express 2.0 in <strong>Bled, Slovenia<\/strong>, as part of the eLex 2025 conference (Rikli Balance Hotel).<\/li>\n<\/ul>\n<p>For details and registration, visit <a href=\"https:\/\/www.clarin.si\/info\/k-centre\/workshops\/classla-express\/\" target=\"_blank\" rel=\"noopener\">the CLASSLA-Express website<\/a>.<\/p>\n<p>We hope to see you at one of the workshops!<\/p>\n<p>The CLASSLA-Express team: Jelena Parizoska, Ivana Filipovi\u0107 Petrovi\u0107, Petya Osenova, Taja Kuzman and Nikola Ljube\u0161i\u0107<\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2025\/04\/CLASSLA-Express-2.0.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-8059 aligncenter\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2025\/04\/CLASSLA-Express-2.0.png\" alt=\"\" width=\"415\" height=\"233\" \/><\/a><\/p>\n<h3><strong>December 18, 2024 \u2013 CLASSLA Annual Recap: 2024 in Review<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">As the year comes to a close, we would like to share a brief summary of the main activities and progress made at the CLASSLA Knowledge Centre for South Slavic Languages during 2024.<\/span><\/p>\n<p><b>CLASSLA web corpora for South Slavic languages<\/b><\/p>\n<p><span style=\"font-weight: 400;\">This year, we set up a crawling infrastructure for the (bi)annual collection of web corpora for South Slavic languages \u2013 the <\/span><a href=\"https:\/\/aclanthology.org\/2024.lrec-main.291\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-web corpora collection<\/span><\/a><span style=\"font-weight: 400;\">. The first version of corpora, CLASSLA-web 1.0, comprising 11 billion words in 7 languages, was <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#open\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">included to the CLARIN.SI concordancers<\/span><\/a><span style=\"font-weight: 400;\"> in 2023 and <\/span><a href=\"https:\/\/www.clarin.si\/repository\/xmlui\/discover?query=%22CLASSLA-web%22&amp;submit=Search&amp;filtertype_1=title&amp;filter_relational_operator_1=contains&amp;filter_1=%22CLASSLA-web%22&amp;filtertype_2=title&amp;filter_relational_operator_2=contains&amp;filter_2=&amp;query=&amp;rpp=10&amp;sort_by=dc.date.issued_dt&amp;order=desc\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">released on the CLARIN.SI repository this year<\/span><\/a><span style=\"font-weight: 400;\">. The web corpora are linguistically annotated with an <\/span><a href=\"https:\/\/zenodo.org\/records\/13936406\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">improved CLASSLA-Stanza<\/span><\/a><span style=\"font-weight: 400;\"> tool for linguistic annotation of South Slavic languages (<\/span><a href=\"https:\/\/clarin.si\/oznacevalnik\/eng\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">available as a service here<\/span><\/a><span style=\"font-weight: 400;\">) and a multilingual genre classifier <\/span><a href=\"https:\/\/huggingface.co\/classla\/xlm-roberta-base-multilingual-text-genre-classifier\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">X-GENRE<\/span><\/a><span style=\"font-weight: 400;\">. Owing to their large size and recency, the CLASSLA-web corpora have already shown to be very useful for the development of large language models for South Slavic languages, and were included in the training datasets for the <\/span><a href=\"https:\/\/huggingface.co\/cjvt\/GaMS-1B\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">GaMS<\/span><\/a><span style=\"font-weight: 400;\"> (Generative Model for Slovene) model and the <\/span><a href=\"https:\/\/huggingface.co\/gordicaleksa\/YugoGPT\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">YugoGPT<\/span><\/a><span style=\"font-weight: 400;\"> model for Bosnian, Croatian, and Serbian. The next version of the CLASSLA web corpora has already been collected, and the release is planned for 2025.<\/span><\/p>\n<p><b>CLASSLA-Express<\/b> <b>workshop series<\/b><\/p>\n<p><span style=\"font-weight: 400;\">In collaboration with Ivana Filipovi\u0107 Petrovi\u0107, Jelena Parizoska and Petya Osenova, we organized seven <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/workshops\/classla-express\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-Express workshops<\/span><\/a><span style=\"font-weight: 400;\"> in five South Slavic countries, attended by over 120 participants. The workshops focused on introducing concordancers, CLASSLA-web corpora, and CLARIN.SI services to linguists, lexicographers, language teachers, digital humanities scholars, and students. Feedback was extremely positive, and we are planning additional workshops for 2025, with sessions to be held in Bulgaria, Croatia, and Slovenia, as well as expanding beyond the South Slavic region to locations such as Austria. The workshops will also feature new topics, including <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/workshops\/#September_2024_Round_Table_on_the_Usage_of_Large_Language_Models_in_Corpus-Linguistic_Research\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">the application of large language models in corpus linguistics and lexicography<\/span><\/a><span style=\"font-weight: 400;\">. Stay tuned for more details about the upcoming workshops!<\/span><\/p>\n<p><b>Benchmarking LLMs for South Slavic languages and dialects<\/b><\/p>\n<p><span style=\"font-weight: 400;\">The rapid advancements in large language models have also reached South Slavic languages, and evaluation of their capabilities has become crucial to understand the strengths and limitations of these models for our languages, and to guide future development in both academic and applied settings. To this end, we benchmarked large language models for South Slavic languages and <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1766\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">dialects, including the Torlak, the Chakavian, and the Cerkno dialect<\/span><\/a><span style=\"font-weight: 400;\">, on the task of commonsense reasoning. <\/span><a href=\"https:\/\/aclanthology.org\/2024.vardial-1.18\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">The results<\/span><\/a><span style=\"font-weight: 400;\"> showed impressive capabilities of GPT models in handling South Slavic languages, showcasing not only their strong performance but also their ability to adapt to dialects. Remarkably, these models achieved high levels of accuracy in target dialects when provided with only a handful of examples. We are excited to continue our benchmarking activities as part of the LLM4DH and <\/span><a href=\"https:\/\/alt-edic.eu\/projects\/llms4eu\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">LLMs4EU<\/span><\/a><span style=\"font-weight: 400;\"> projects, which will extend over the next few years.<\/span><\/p>\n<p><b>Speech technologies<\/b><\/p>\n<p><span style=\"font-weight: 400;\">We continued dipping our toes into the world of speech technology. Our efforts included the development of the <\/span><a href=\"https:\/\/huggingface.co\/classla\/whisper-large-v3-mici-princ\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">automatic speech recognition (ASR) system tailored to the Chakavian dialect<\/span><\/a><span style=\"font-weight: 400;\"> based on the <\/span><a href=\"https:\/\/huggingface.co\/datasets\/classla\/Mici_Princ\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Mi\u0107i Princ dataset<\/span><\/a><span style=\"font-weight: 400;\">. We also worked on the <\/span><a href=\"https:\/\/mezzanine.um.si\/en\/mezzanine-english\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Mezzanine<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/arxiv.org\/abs\/2409.15397\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">ParlaSpeech<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/zenodo.org\/records\/13936420\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Mak na konac<\/span><\/a><span style=\"font-weight: 400;\"> projects, which focus on developing spoken corpora and benchmarking speech technologies for Slovenian, Croatian and Serbian. In addition to developing various speech technologies, such as the <\/span><a href=\"https:\/\/huggingface.co\/classla\/wav2vecbert2-filledPause\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">classifier for filled pauses in speech (eem)<\/span><\/a><span style=\"font-weight: 400;\"> that works splendidly for a series of South Slavic languages, we started building the CLASSLA infrastructure for speech research by <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlaspeech_hr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">publishing ParlaSpeech corpora also on concordancers<\/span><\/a><span style=\"font-weight: 400;\">. We are currently working on further enriching these corpora with disfluency information, primary stress position, and boundaries of prosodic units.<\/span><\/p>\n<p><b>Sharing knowledge on language resources for South Slavic languages<\/b><\/p>\n<p><span style=\"font-weight: 400;\">As a knowledge centre, one of our core activities is sharing valuable information and supporting users in their work with language resources and technologies. Over the past year, we have responded to numerous helpdesk inquiries regarding access to resources and their use. In addition to providing direct support, we also maintain informative materials to help users navigate available resources \u2013 the CLASSLA FAQs for <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/faq4slovene\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Slovenian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/faq4croatian\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Croatian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/faq4serbian\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Serbian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/faq4bulgarian\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Bulgarian<\/span><\/a><span style=\"font-weight: 400;\">, and <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/faq4macedonian\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Macedonian<\/span><\/a><span style=\"font-weight: 400;\">. Furthermore, we released a new <\/span><a href=\"https:\/\/github.com\/clarinsi\/Slovenian-Language-Technologies-Overview\/tree\/main\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">overview of Slovenian language technologies<\/span><\/a><span style=\"font-weight: 400;\">, summarizing the state-of-the-art language technologies for Slovenian.<\/span><\/p>\n<p><b>Monitoring the usage of language resources<\/b><\/p>\n<p><span style=\"font-weight: 400;\">We also actively supported our parent organization, <\/span><span style=\"font-weight: 400;\">CLARIN.SI<\/span><span style=\"font-weight: 400;\">, by monitoring the usage of freely accessible language resources and concordancers provided by the CLARIN.SI infrastructure. This allowed us to gain valuable insights into which datasets, technologies, and corpora are used the most. We were pleased to discover significant usage from outside Slovenia, with users frequently querying corpora in over 18 different languages. We invite you to watch a brief <\/span><a href=\"https:\/\/www.clarin.si\/info\/end-of-year-review-clarin-si-in-2024\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">1-minute video<\/span><\/a><span style=\"font-weight: 400;\"> presenting key statistics, including the number of visits, most popular resources, and a closer look at concordancer usage.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We are also very happy with the uptake of our <\/span><a href=\"https:\/\/huggingface.co\/classla\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Hugging Face page<\/span><\/a><span style=\"font-weight: 400;\"> from where our ParlaSpeech corpora have been downloaded more than 6,000 times in the last few months. Our models are also heavily used, with the recently published <\/span><a href=\"https:\/\/huggingface.co\/classla\/multilingual-IPTC-news-topic-classifier\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">multilingual IPTC news topic classifier<\/span><\/a><span style=\"font-weight: 400;\"> being downloaded almost 13,000 times in the past four months.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We would like to take this opportunity to thank all our collaborators for another incredibly productive year and to express our gratitude to you for staying engaged with our activities. We look forward to another year of exciting developments and continued collaboration. Wishing you all a successful and fulfilling year ahead, both professionally and personally.<\/span><\/p>\n<h3><strong>May 6, 2024 \u2013 CLASSLA-Express Workshops in North Macedonia, Bulgaria and Slovenia<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">This year, we are <\/span><span style=\"font-weight: 400;\">organizing the <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/workshops\/classla-express\/\" target=\"_blank\" rel=\"noopener\"><b>CLASSLA-Express workshop series<\/b><\/a><span style=\"font-weight: 400;\">: six workshops in 5 countries designed to demonstrate the practical applications of the <\/span><a href=\"https:\/\/arxiv.org\/abs\/2403.12721\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA web corpora<\/span><\/a><span style=\"font-weight: 400;\"> in language research. The first two workshops have already taken place in Zagreb and Rijeka &#8211; <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/workshops\/#Report_from_the_first_two_stops_of_CLASSLA-Express_Zagreb_and_Rijeka\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">see the report on how well it went<\/span><\/a><span style=\"font-weight: 400;\"> &#8211; and the registration for the May workshop in Belgrade is now closed as we have already reached the maximum capacity.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We are very happy to announce that since the start of the series, we have added an additional stop to the workshop series: in June, we will also visit Sofia (Bulgaria) at the <\/span><a href=\"https:\/\/clada-bg.eu\/en\/dissemination\/events\/international-clada-bg-conference-2024.html\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">International CLaDA-BG Conference 2024<\/span><\/a><span style=\"font-weight: 400;\">, which was arranged in collaboration with the <\/span><a href=\"https:\/\/clada-bg.eu\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLaDa-BG<\/span><\/a><span style=\"font-weight: 400;\"> research infrastructure, a member of the CLASSLA knowledge centre.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Registration for the next three CLASSLA-Express workshops in Skopje, Sofia and Ljubljana is now open! Here are the details on the workshops:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">4 June 2024 &#8211; CLASSLA-Express stop in Skopje, North Macedonia (<\/span><a href=\"https:\/\/flf.ukim.mk\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Bla\u017ee Koneski Faculty of Philology, Ss. Cyril and Methodius University<\/span><\/a><span style=\"font-weight: 400;\">). <\/span><a href=\"https:\/\/docs.google.com\/forms\/d\/e\/1FAIpQLSehSj9Y0n8qL6VD4bpNihmelKyqE0lnCPE3PZUquMqeiKn8Aw\/viewform\" target=\"_blank\" rel=\"noopener\"><b>Register here<\/b><\/a><b>.<\/b> <a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2024\/05\/Poziv-na-radionicu-CLASSLA-Express_SK.docx.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">More information about the programme and location is available here.<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">26 June 2024 &#8211; CLASSLA-Express stop in Sofia, Bulgaria (<\/span><a href=\"https:\/\/clada-bg.eu\/en\/dissemination\/events\/international-clada-bg-conference-2024.html\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">International CLaDA-BG Conference 2024<\/span><\/a><span style=\"font-weight: 400;\">). <\/span><a href=\"https:\/\/docs.google.com\/forms\/d\/e\/1FAIpQLScn7uWfMpmAB-f7vBtI72ULn-0Snt_SivbDhimSUAGKGaWO7A\/viewform\" target=\"_blank\" rel=\"noopener\"><b>Register here<\/b><\/a><span style=\"font-weight: 400;\">. <\/span><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2024\/05\/Call-for-participation_CLASSLA-Express_Sofia.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">More information about the programme and location is available here.<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">18 September 2024 &#8211; CLASSLA-Express stop in Ljubljana, Slovenia (<\/span><a href=\"https:\/\/www.sdjt.si\/wp\/jtdh-2024-en\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Language Technologies &amp; Digital Humanities Conference 2024, University of Ljubljana<\/span><\/a><span style=\"font-weight: 400;\">). <\/span><a href=\"https:\/\/docs.google.com\/forms\/d\/e\/1FAIpQLSfJLKuEc-IqrXUre6Qi2JjOFnumpl2hXcz5B5r7H810Dn_Y-Q\/viewform\" target=\"_blank\" rel=\"noopener\"><b>Register here<\/b><\/a><span style=\"font-weight: 400;\">. <\/span><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2024\/05\/Call-for-participation_CLASSLA-Express_LJ.docx.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">More information about the programme and location is available here<\/span><span style=\"font-weight: 400;\">.<\/span><\/a><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">The workshops, which are free of charge, will provide hands-on experience in using the <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#open\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLARIN.SI NoSketch Engine concordancer<\/span><\/a><span style=\"font-weight: 400;\"> to extract valuable insights on word meanings, usage, collocations, and grammatical patterns from <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_bg\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Bulgarian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_mk\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Macedonian<\/span><\/a><span style=\"font-weight: 400;\">, and <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_sl\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Slovene<\/span><\/a><span style=\"font-weight: 400;\"> corpora. The workshops are tailored for university students of South Slavic languages, linguists, lexicographers, language teachers, and digital humanities scholars.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We warmly welcome you to join us at the nearest workshop location. For further details, please visit <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/workshops\/classla-express\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/www.clarin.si\/info\/k-centre\/workshops\/classla-express\/<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h3><strong>April 15, 2024 \u2013 Mi\u0107i Princ meets the Whisper ASR model<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">As you might have noticed, recently, we extended our efforts of providing language resources and technologies from standard South Slavic languages to South Slavic dialects as well (you might have heard about the COPA datasets in Cerkno, Torlak and Chakavian dialects which are the stars of <\/span><a href=\"https:\/\/sites.google.com\/view\/vardial-2024\/shared-tasks\/dialect-copa\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">the DIALECT-COPA unshared task at the VarDial 2024 workshop<\/span><\/a><span style=\"font-weight: 400;\"> in Mexico City). Now, we are pleased to announce the first resources for speech technologies for Chakavian micro-dialects of Croatian: <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1765\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">the Mi\u0107i Princ dataset<\/span><\/a><span style=\"font-weight: 400;\"> and an <\/span><a href=\"https:\/\/huggingface.co\/classla\/whisper-large-v3-mici-princ\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">automatic speech recognition model for Chakavian<\/span><\/a><span style=\"font-weight: 400;\">, both openly available.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1765\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Mi\u0107i Princ dataset<\/span><\/a><span style=\"font-weight: 400;\"> is a &#8220;text and speech&#8221; dialectal translation of Antoine de Saint-Exup\u00e9ry&#8217;s &#8220;Le Petit Prince&#8221; (The Little Prince) into various Chakavian micro-dialects, released by the Udruga Calculus and the Peek&amp;Poke museum<\/span><span style=\"font-weight: 400;\">, both in form of a <\/span><a href=\"https:\/\/www.peekpoke.hr\/mici-princ-an-edition-of-the-little-prince-in-the-chakavian-dialect-book-presentation\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">printed book<\/span><\/a><span style=\"font-weight: 400;\"> and an <\/span><a href=\"https:\/\/www.peekpoke.hr\/mici-princ-the-little-prince-in-the-chakavian-dialect-audio-book-presentation-and-exhibition\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">audio book<\/span><\/a><span style=\"font-weight: 400;\">. Almost every character in the book was translated and narrated into a different micro-dialect (for which we would like to thank again the large team of translators and audio book narrators behind this, especially the main translator, Tea Perin\u010di\u0107).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Following the creation of the Mi\u0107i Princ dataset, our colleagues Peter Rupnik and Nikola Ljube\u0161i\u0107 aligned the text and speech to develop the first openly-available dataset for Chakavian automatic-speech recognition (ASR). <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1765\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">The dataset is published on the CLARIN.SI repository<\/span><\/a><span style=\"font-weight: 400;\">, as well as on <\/span><a href=\"https:\/\/huggingface.co\/datasets\/classla\/Mici_Princ\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Hugging Face, where you can listen to it<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Moreover, we are pleased to introduce an innovative outcome derived from this dataset: <\/span><a href=\"https:\/\/huggingface.co\/classla\/whisper-large-v3-mici-princ\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Whisper-large-v3-mici-princ<\/span><\/a><span style=\"font-weight: 400;\">, an automatic speech recognition model for Chakavian. Through fine-tuning OpenAI&#8217;s Whisper model on the Mi\u0107i Princ dataset, <\/span><span style=\"font-weight: 400;\">we achieved a great character-error-rate reduction of 66%. You are welcome to <\/span><a href=\"https:\/\/huggingface.co\/classla\/whisper-large-v3-mici-princ\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">try it out on Hugging Face<\/span><\/a><span style=\"font-weight: 400;\">!<\/span><\/p>\n<h3><strong>March 26, 2024 \u2013 CLASSLA-Express Workshops in Croatia, Serbia, North Macedonia and Slovenia<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">We are excited to announce a series of five workshops designed to demonstrate the practical applications of the <\/span><a href=\"https:\/\/arxiv.org\/abs\/2403.12721\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA web corpora<\/span><\/a><span style=\"font-weight: 400;\"> in language research. The CLASSLA-Express workshops will take place from April to September 2024 in 4 countries and 5 cities: Croatia (Zagreb and Rijeka), Serbia (Belgrade), North Macedonia (Skopje) and Slovenia (Ljubljana). The workshops will provide hands-on experience in using the <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#open\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLARIN.SI NoSketch Engine concordancer<\/span><\/a><span style=\"font-weight: 400;\"> to extract valuable insights on word meanings, usage, collocations, and grammatical patterns from <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_hr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Croatian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_mk\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Macedonian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_sr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Serbian<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_sl\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Slovene<\/span><\/a><span style=\"font-weight: 400;\"> corpora. The workshops are free of charge.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The workshops are tailored for university students of South Slavic languages, linguists, lexicographers, language teachers, and digital humanities scholars.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The registration is already open for the workshops in Zagreb, Rijeka and Belgrade! We will make sure to let you know when the registrations in Skopje and Ljubljana open as well.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here are the details of the workshops:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">19 April 2024 \u2013 CLASSLA-Express stop in Zagreb, Croatia (Faculty of Humanities and Social Sciences, University of Zagreb). <\/span><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2024\/03\/Poziv-na-radionicu-CLASSLA-Express_ZG.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">More information about the programme and location is available here.<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">26 April 2024 \u2013 CLASSLA-Express stop in Rijeka, Croatia (Center for Language Research, Faculty of Humanities and Social Sciences, University of Rijeka).<\/span><span style=\"font-weight: 400;\">\u00a0<\/span><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2024\/03\/Poziv-na-radionicu-CLASSLA-Express_RI.docx.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">More information about the programme and location is available here.<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">29 May 2024 \u2013 CLASSLA-Express stop in Belgrade, Serbia (International conference <\/span><i><span style=\"font-weight: 400;\">Leksikografski susreti<\/span><\/i><span style=\"font-weight: 400;\">, Faculty of Philology, University of Belgrade).<\/span><span style=\"font-weight: 400;\">\u00a0<\/span><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2024\/03\/Poziv-na-radionicu-CLASSLA-Express_BG.docx.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">More information about the programme and location is available here.<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">4 June 2024 \u2013 CLASSLA-Express stop in Skopje, North Macedonia (Bla\u017ee Koneski Faculty of Philology, Ss. Cyril and Methodius University). Registration is not open yet \u2013 we will let you know when it opens.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">18 September 2024 \u2013 CLASSLA-Express stop in Ljubljana, Slovenia (Language Technologies &amp; Digital Humanities Conference 2024, University of Ljubljana). Registration is not open yet \u2013 we will let you know when it opens.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">We warmly welcome you to join us at the nearest workshop location. For registration and further details, please visit <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/workshops\/classla-express\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/www.clarin.si\/info\/k-centre\/workshops\/classla-express\/<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We would also like to kindly ask you to spread the word among researchers, students and other interested colleagues.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The CLASSLA-Express team: Ivana Filipovi\u0107 Petrovi\u0107, Jelena Parizoska, Taja Kuzman and Nikola Ljube\u0161i\u0107<\/span><\/p>\n<h3><strong>December 24, 2023 \u2013 Updates on new Macedonian resources, and South Slavic endeavours to develop LLMs and ASR models<\/strong><\/h3>\n<p>This year has been extremely packed with activities, hence this very last-minute cheer &#8211; we are on a good track to become much less of a less-resourced language family! We give a few examples that come to mind first.<\/p>\n<div>&#8211; <a href=\"https:\/\/universaldependencies.org\/treebanks\/mk_mtb\/index.html\" target=\"_blank\" rel=\"noopener\">Macedonian has arrived to Universal Dependencies\u00a0\ud83e\udd73<\/a>\ud83e\udd73\ud83e\udd73 thanks to Vladimir Cvetkoski, this may be &#8220;only&#8221; 155 sentences and 1.360 tokens, but, hey &#8211; it is infinitely more than there was before. Bravo, Vladimir!<\/div>\n<div><\/div>\n<div>&#8211; CLASSLA followed the great example of Vladimir and decided to <a href=\"http:\/\/hdl.handle.net\/11356\/1886\" target=\"_blank\" rel=\"noopener\">publish SETimes.MK<\/a> in its current status as version 0.1,\u00a0570 sentences and\u00a013.310 tokens in size, annotated on XPOS, UPOS, FEATS and LEMMA level, to give additional momentum to the positively developing situation for Macedonian.<\/div>\n<div><\/div>\n<div>&#8211; In Slovenia, <a href=\"https:\/\/www.cjvt.si\/povejmo\/\" target=\"_blank\" rel=\"noopener\">the PoVeJMo project<\/a> has started, focused on adapting an LLM to Slovenian language in general, as well as adapting it to a series of industrial use cases.<\/div>\n<div><\/div>\n<div>&#8211; Andrija Sagi\u0107, a multimedia enthusiast, is seriously biting in the speech apple, <a href=\"https:\/\/huggingface.co\/Sagicc\/whisper-large-v3-sr-cmb\" target=\"_blank\" rel=\"noopener\">additionally fine-tuning the really great whisper-large-v3 model<\/a> on all the data he can scrape together for Serbian, which mostly includes <a href=\"http:\/\/hdl.handle.net\/11356\/1679\" target=\"_blank\" rel=\"noopener\">our Ju\u017ene Vesti dataset<\/a>. We are now working with Andrija on improving the dataset (quite many typos in the human transcript!) and are looking forward to jointly publishing a version 2.0. This is the type of collaboration we are very much in need of!<\/div>\n<div><\/div>\n<div>&#8211; The ReLDI team has started, together with ICEF, Belgrade, the industry-funded (you do not see many of those!)\u00a0<a href=\"https:\/\/icef-nlp.github.io\/COMtext.SR\/\" target=\"_blank\" rel=\"noopener\">ComText.SR project<\/a>\u00a0on collecting, curating, annotating and publicly releasing textual data for various domains of special interest to the industry.<\/div>\n<div><\/div>\n<div>&#8211; The JeRTeh society has started publishing transformer models for Serbian, the first two models being named <a href=\"https:\/\/huggingface.co\/jerteh\/gpt2-vrabac\" target=\"_blank\" rel=\"noopener\">Vrabac<\/a> and <a href=\"https:\/\/huggingface.co\/jerteh\/gpt2-orao\" target=\"_blank\" rel=\"noopener\">Orao<\/a>. You guess which is the bigger one. \ud83d\ude42 We were told there will be additional models coming from that direction and we are very much looking forward to those!<\/div>\n<div><\/div>\n<div>&#8211; You might have followed on social media the most productive project we have ever seen &#8211; \u00a0the <a href=\"https:\/\/www.linkedin.com\/posts\/aleksagordic_well-its-official-yugogpt-7b-significantly-activity-7143209223722627072-0s9Y\/\" target=\"_blank\" rel=\"noopener\">yugoGPT model<\/a> &#8211; work of Aleksa Gordi\u0107. We were happy to be able to support Aleksa at least on the data and some discussion front. It was not easy to keep up with that guy! Wow! We really hope this is not Aleksa&#8217;s (first? and) last HBS LLM rodeo!<\/div>\n<h3><strong>December 6, 2023 \u2013 <\/strong><strong>Comparable web corpora CLASSLA-web for all South Slavic languages<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">We are delighted to announce the release of comparable web corpora for all official South Slavic languages, namely <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_sl\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Slovenian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_hr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Croatian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_bs\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Bosnian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_cnr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Montenegrin<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_sr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Serbian<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_mk\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Macedonian<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#concordance?corpname=classlaweb_bg\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Bulgarian<\/span><\/a><span style=\"font-weight: 400;\">, all the corpora summing up to almost 11 billion words! The corpora are freely available on the CLARIN.SI <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#open\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">NoSketch Engine<\/span><\/a><span style=\"font-weight: 400;\"> concordancer (see our recent <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">tutorial on how to easily query the CLASSLA web corpora and perform statistical analyses via the concordancer<\/span><\/a><span style=\"font-weight: 400;\">).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This collection of corpora is very innovative, due to the following reasons:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">This is, to the best of our knowledge, the first collection of comparable web corpora covering a whole language group.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The collection includes the first general, linguistically annotated corpora for two out of seven languages, namely Montenegrin and Macedonian.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">The comparability of the corpora is ensured by performing data collection and filtering in the same time period with the same technologies. Furthermore, the corpora underwent a uniform linguistic processing via the <\/span><a href=\"https:\/\/pypi.org\/project\/classla\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-Stanza<\/span><\/a><span style=\"font-weight: 400;\"> toolkit, which you can now try out also through the <\/span><a href=\"https:\/\/clarin.si\/oznacevalnik\/eng\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA annotator web interface<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Each of the documents in each of the corpora is annotated with the <\/span><a href=\"https:\/\/huggingface.co\/classla\/xlm-roberta-base-multilingual-text-genre-classifier\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">X-GENRE multilingual genre classifier<\/span><\/a><span style=\"font-weight: 400;\">.\u00a0<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">For more details, we warmly invite you to read our <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/comparable-classla-web-corpora-of-south-slavic-languages\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">new blog post which introduces the CLASSLA-web corpora<\/span><\/a><span style=\"font-weight: 400;\">. The blog post provides more details on the corpora sizes and interesting insights on the correlations between genre distributions and GDP per capita across the seven South Slavic countries.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We will be very glad to obtain feedback on our corpora and annotation technology. As usual, please write to us on <\/span><a href=\"mailto:helpdesk.classla@clarin.si\"><span style=\"font-weight: 400;\">helpdesk.classla@clarin.si<\/span><\/a><span style=\"font-weight: 400;\">!<\/span><\/p>\n<p><span style=\"font-weight: 400;\">These corpora would not have been released without great collaboration inside the CLASSLA Knowledge centre for South Slavic languages, which includes the Slovenian consortium <\/span><a href=\"https:\/\/www.clarin.si\/info\/about\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLARIN.SI<\/span><\/a><span style=\"font-weight: 400;\">, the <\/span><a href=\"http:\/\/ihjj.hr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Institute of Croatian Language<\/span><\/a><span style=\"font-weight: 400;\">, and the Bulgarian consortium <\/span><a href=\"https:\/\/clada-bg.eu\/en\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLADA-BG<\/span><\/a><span style=\"font-weight: 400;\">. Furthermore, very crucial were the longstanding collaboration with the <\/span><a href=\"https:\/\/reldi.spur.uzh.ch\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">ReLDI centre<\/span><\/a><span style=\"font-weight: 400;\"> on a series of South Slavic languages, and Biljana Stojanovska and Katerina Zdravkova on Macedonian. On this occasion, we want to thank everyone for the collaboration, and invite others to join our common efforts!<\/span><\/p>\n<h3><strong>September 13, 2023 \u2013 New state-of-the-art version of CLASSLA-Stanza pipeline for linguistic processing of South Slavic languages<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">We are delighted to announce the release of an improved <\/span><a href=\"https:\/\/pypi.org\/project\/classla\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-Stanza pipeline<\/span><\/a><span style=\"font-weight: 400;\">, which enables state-of-the-art linguistic processing of Slovenian, Croatian, Serbian, Macedonian and Bulgarian language.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In addition to covering standard varieties of five South Slavic languages, the pipeline also provides special modules for linguistic annotation of non-standard text and web corpora for Slovenian, Croatian and Serbian. The CLASSLA-Stanza annotation tool supports a total of six tasks: tokenization, morphosyntactic annotation, lemmatization, dependency parsing, semantic role labeling, and named-entity recognition. Some of the main improvements that separate CLASSLA-Stanza from the Stanza pipeline are:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">support of external inflectional lexicons which significantly increases performance on morphologically rich languages;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">extended training datasets (beyond Universal Dependencies data) for all included models;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">use of <\/span><a href=\"https:\/\/www.clarin.si\/repository\/xmlui\/discover?query=CLARIN.SI-embed&amp;submit=Search&amp;filtertype_1=type&amp;filter_relational_operator_1=equals&amp;filter_1=lexicalConceptualResource&amp;filtertype_2=title&amp;filter_relational_operator_2=contains&amp;filter_2=&amp;query=embeddings&amp;rpp=10\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLARIN.SI-embed<\/span><\/a><span style=\"font-weight: 400;\"> word embeddings, trained on significantly larger and more diverse datasets than embeddings used by Stanza;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">specific modules for standard, non-standard and web text.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">As a result, we are happy to report that the CLASSLA-Stanza significantly outperforms Stanza, with error reduction between 34% and 98% on the Slovenian official benchmark (see table below which reports the performance using the Micro F1 score). You can find more details on the pipeline improvements and training settings in a technical report \u201c<\/span><a href=\"https:\/\/arxiv.org\/abs\/2308.04255\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages<\/span><\/a><span style=\"font-weight: 400;\">\u201d (Ter\u010don &amp; Ljube\u0161i\u0107, 2023).<\/span><\/p>\n<p><a href=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/10\/Classla-stanza-paper.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-6532\" src=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/10\/Classla-stanza-paper-300x60.png\" alt=\"\" width=\"665\" height=\"133\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/10\/Classla-stanza-paper-300x60.png 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2023\/10\/Classla-stanza-paper.png 710w\" sizes=\"auto, (max-width: 665px) 100vw, 665px\" \/><\/a><\/p>\n<p><span style=\"font-weight: 400;\">You can use CLASSLA-Stanza as a <\/span><a href=\"https:\/\/pypi.org\/project\/classla\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">python library<\/span><\/a><span style=\"font-weight: 400;\"> (documentation is available <\/span><a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">here<\/span><\/a><span style=\"font-weight: 400;\">) or via an <\/span><a href=\"https:\/\/orodja.cjvt.si\/oznacevalnik\/eng\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">online service<\/span><\/a><span style=\"font-weight: 400;\"> (currently available for Slovenian, other languages and modules coming soon). Separate models are also freely available at the <\/span><a href=\"https:\/\/www.clarin.si\/repository\/xmlui\/discover?rpp=10&amp;etal=0&amp;query=classla-stanza&amp;group_by=none&amp;page=1&amp;filtertype_0=type&amp;filter_relational_operator_0=equals&amp;filter_0=toolService\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLARIN.SI repository<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">These results would not be possible without immense efforts in developing high-quality training datasets together with our collaborators all around Europe. We wish to use this opportunity to most warmly thank all of them!<\/span><\/p>\n<h3><strong>June 23, 2023 \u2013 CLASSLA web corpora of Croatian, Serbian and Slovenian<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">We are<\/span><span style=\"font-weight: 400;\"> delighted to announce the release of the pilot versions (v0.1) of the CLASSLA web corpora for <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_hr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Croatian<\/span><\/a><span style=\"font-weight: 400;\"> (2.3 billion words), <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Serbian<\/span><\/a><span style=\"font-weight: 400;\"> (2.4 billion words) and <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sl\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Slovenian<\/span><\/a><span style=\"font-weight: 400;\"> (1.9 billion words). The main features of the newly released corpora, aside from their massive size and recency (crawled in 2022) is their <\/span><a href=\"https:\/\/huggingface.co\/classla\/xlm-roberta-base-multilingual-text-genre-classifier\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">automatic enrichment with genre information<\/span><\/a><span style=\"font-weight: 400;\"> and their linguistic processing with the improved <\/span><a href=\"https:\/\/pypi.org\/project\/classla\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA-Stanza annotation pipeline<\/span><\/a><span style=\"font-weight: 400;\"> (applied version to be released soon). The corpora are available for search via the CLARIN.SI concordancers, <\/span><a href=\"https:\/\/www.clarin.si\/ske\/#open\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Crystal NoSketchEngine<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.clarin.si\/noske\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Bonito NoSketchEngine<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/www.clarin.si\/kontext\/corpora\/corplist\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">KonText<\/span><\/a><span style=\"font-weight: 400;\">. The pilot versions of these corpora are intended to gather valuable user feedback, while the official release (v1.0) of the three existing corpora, along with web corpora for Bosnian, Montenegrin, Macedonian, and Bulgarian, is scheduled for later this year.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We warmly welcome you to explore our corpora. Please reach out to us at <\/span><a href=\"mailto:helpdesk.classla@clarin.si\"><span style=\"font-weight: 400;\">helpdesk.classla@clarin.si<\/span><\/a><span style=\"font-weight: 400;\"> with any ideas for improvements <\/span><span style=\"font-weight: 400;\">\u2014<\/span><span style=\"font-weight: 400;\"> we will try hard to implement them in the upcoming official release already! We also encourage you to share with us how you plan to use these corpora in your research, as well as any other use cases you may have in mind.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To give you some ideas on how the corpora can be used in your research you are invited to read <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/classla-web-bigger-and-better-web-corpora-for-croatian-serbian-and-slovenian-on-clarin-si-concordancers\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">our blog post on the use of CLASSLA web corpora via the open CLARIN.SI concordancers<\/span><\/a><span style=\"font-weight: 400;\">. The step-by-step tutorial covers a wide range of functionalities of the concordancers, including finding collocations in different genres, analyzing word statistics, and exploring the use of non-standard words. This resource is particularly suited for linguists, language teachers and digital humanists.<\/span><\/p>\n<h3><strong>April 25, 2023 \u2013 A web corpus intermezzo<\/strong><\/h3>\n<p>We were keeping rather silent for some time now due to many developments that required our full capacity. But you can expect reports on interesting resources, tools, and experiments in the following months!<\/p>\n<p>We were, however, not the only ones who were very busy in the previous period. Philipp Wasserscheidt has recently published the PDRS web corpus of Serbian language, 715 million tokens in size. You can find more details on the corpus in the <a href=\"http:\/\/hdl.handle.net\/11356\/1752\" target=\"_blank\" rel=\"noopener\">CLARIN.SI repository entry<\/a> where the corpus is available for download. The corpus is also available <a href=\"https:\/\/www.clarin.si\/noske\/run.cgi\/corp_info?corpname=pdrs10&amp;struct_attr_stats=1\" target=\"_blank\" rel=\"noopener\">via the CLARIN.SI concordancers<\/a>.<\/p>\n<p>Philipp is also making sure that future users know how to use the corpus. This is slightly last-minute, but maybe still not too late for some of you &#8211; a workshop on the PDRS web corpus usage will be held from this Thursday to Saturday in Belgrade. More information is available at <a href=\"https:\/\/javnidiskurs.rs\/poziv-na-radionicu-pdrs-1-0\/\" target=\"_blank\" rel=\"noopener\">https:\/\/javnidiskurs.rs\/poziv-na-radionicu-pdrs-1-0\/<\/a>.<\/p>\n<p>Since we are on the topic of web corpora, we have two pieces of news to share right away as well:<\/p>\n<p>1. The head of the CLASSLA centre, Nikola Ljube\u0161i\u0107, has taken one of the leading roles in the ACL Special Interest Group for Web as a Corpus (SIGWAC). If you are interested in this area of research, you should join the SIG by <a href=\"http:\/\/devel.sslmit.unibo.it\/mailman\/listinfo\/sigwac\" target=\"_blank\" rel=\"noopener\">signing up to the mailing list<\/a>.<\/p>\n<p>2. We are in the process of releasing the MaCoCu datasets, which are web crawls of various national top-level domains, including those of Slovenia, Croatia, Bosnia and Herzegovina, Montenegro, Serbia, Macedonia and Bulgaria. We are sharing here the link just to the <a href=\"http:\/\/hdl.handle.net\/11356\/1801\" target=\"_blank\" rel=\"noopener\">Macedonian dataset<\/a>. Linguistic processing of the datasets has just started, and will result in the CLASSLA web corpora, to be updated on a biyearly basis.<\/p>\n<h3><strong>December 22, 2022 \u2013 Looking forward to 2023!<\/strong><\/h3>\n<p>We wanted to wish all of you happy holidays and a successful 2023. To wrap-up a very busy, but also a very successful 2022, we are sharing with you what we will be releasing in the first half of the next year.<\/p>\n<p>We are working on releasing a new version of our <a href=\"https:\/\/pypi.org\/project\/classla\/\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza tool<\/a>, with the following improvements:<\/p>\n<ul>\n<li>Minor improvements on usability and programming interface<\/li>\n<li>New Slovenian models for standard and Internet non-standard language, but also for spoken language (transcripts), most of the improvements being the results of the VERY successful <a href=\"https:\/\/rsdo.slovenscina.eu\/\" target=\"_blank\" rel=\"noopener\">RSDO project<\/a><\/li>\n<li>New standard and non-standard models for Croatian and Serbian, as we are constantly working on <a href=\"https:\/\/github.com\/reldi-data\" target=\"_blank\" rel=\"noopener\">improving our data<\/a> (it is a never-ending game)<\/li>\n<li>Drastically improved standard model for Macedonian (we resolved numerous errors by extending the training data (previous model was trained only on an Orwell&#8217;s novel))<\/li>\n<li>We will also release the tool through a web interface and a web service, similar to the <a href=\"https:\/\/orodja.cjvt.si\/oznacevalnik\/eng\/\" target=\"_blank\" rel=\"noopener\">RSDO interface for linguistic processing of Slovenian<\/a> (which also uses CLASSLA-Stanza)<\/li>\n<\/ul>\n<p>Inside the <a href=\"https:\/\/www.clarin.eu\/parlamint\" target=\"_blank\" rel=\"noopener\">ParlaMint project<\/a> we are working hard on releasing parliamentary corpora for the Slovenian, Croatian, Bosnian, Serbian and Bulgarian parliaments, which is one of the big coordination successes of the CLASSLA K-centre. Just for comparison, in the <a href=\"https:\/\/link.springer.com\/article\/10.1007\/s10579-021-09574-0\" target=\"_blank\" rel=\"noopener\">first iteration of the ParlaMint corpus<\/a>, there was only one term of the Croatian parliament covered, while now the corpus will cover six terms. Bosnian and Serbian were not part of the first iteration of the ParlaMint corpus.<\/p>\n<p>Finally, we will also publish our new generation of web corpora, called CLASSLA-web. We already have prepared the raw data for <a href=\"http:\/\/hdl.handle.net\/11356\/1517\" target=\"_blank\" rel=\"noopener\">Slovenian<\/a> (1.8 billion words), <a href=\"http:\/\/hdl.handle.net\/11356\/1516\" target=\"_blank\" rel=\"noopener\">Croatian<\/a> (2.3 billion words), <a href=\"http:\/\/hdl.handle.net\/11356\/1512\" target=\"_blank\" rel=\"noopener\">Macedonian<\/a> (524 million words), and <a href=\"http:\/\/hdl.handle.net\/11356\/1515\" target=\"_blank\" rel=\"noopener\">Bulgarian<\/a> (3.5 billion words), but will release the corpora both for download and through concordancers once we have all the languages fully processed (we are currently processing Bosnian, Montenegrin and Serbian) and data annotated with the latest version of CLASSLA-Stanza.<\/p>\n<h3><strong>October 19, 2022 \u2013 Our recent activities on speech<\/strong><\/h3>\n<p>We wanted to share with you our recent results on speech processing, something we mentioned will be one of our foci in 2022.<\/p>\n<p>We released two speech datasets. One is in Croatian, the <a href=\"http:\/\/hdl.handle.net\/11356\/1494\" target=\"_blank\" rel=\"noopener\">ParlaSpeech-HR dataset<\/a>, 1816 hours of recordings in size, with accompanying transcriptions and speaker metadata. The dataset is based on the <a href=\"http:\/\/hdl.handle.net\/11356\/1432\" target=\"_blank\" rel=\"noopener\">ParlaMint corpus<\/a> of Croatian parliamentary proceedings. The other dataset is in Serbian, the <a href=\"http:\/\/hdl.handle.net\/11356\/1679\" target=\"_blank\" rel=\"noopener\">JuzneVesti-SR dataset<\/a>, \u201conly\u201d 50 hours in size. It consists of audio recordings and transcripts from the Ju\u017ene Vesti website and its host show called <a href=\"https:\/\/www.juznevesti.com\/Tagovi\/Intervju-15-minuta.sr.html\" target=\"_blank\" rel=\"noopener\">15 minuta<\/a>, with speaker metadata available as well. With each of the datasets, we released also automatic speech recognition (ASR) models on HuggingFace, <a href=\"https:\/\/huggingface.co\/models?search=parlaspeech\" target=\"_blank\" rel=\"noopener\">four Croatian ASR models<\/a> for the ParlaSpeech-HR dataset, with excellent (but in-domain) word error rate of only 4%, and for now <a href=\"https:\/\/huggingface.co\/classla\/wav2vec2-xls-r-juznevesti-sr\" target=\"_blank\" rel=\"noopener\">one Serbian ASR model<\/a> for the JuzneVesti-SR dataset. You are more than welcome to take any of the models or data (all are available under CC-BY-SA). Interestingly, our speech-related efforts were very quickly picked up by the industry as well, featuring our speech and text technologies <a href=\"https:\/\/www.neos.hr\/neos-blog-can-ai-understand-croatian-parliment-asr-model\/\" target=\"_blank\" rel=\"noopener\">in a recent blog<\/a>.<\/p>\n<p>We also published two papers, one on the <a href=\"http:\/\/www.lrec-conf.org\/proceedings\/lrec2022\/workshops\/ParlaCLARINIII\/pdf\/2022.parlaclariniii-1.16.pdf\" target=\"_blank\" rel=\"noopener\">overall approach to building the ParlaSpeech-HR dataset<\/a>, another on performing <a href=\"https:\/\/nl.ijs.si\/jtdh22\/pdf\/JTDH2022_Ljubesic-et-al_The-ParlaSpeech-HR-benchmark-for-speaker-profiling-in-Croatian.pdf\" target=\"_blank\" rel=\"noopener\">benchmarking for user profiling over the ParlaSpeech-HR dataset<\/a>.<\/p>\n<p>Given the recent successes in acquiring funding for performing more research on spoken data, in the following years we will be researching many super-interesting speech-related tasks, including:<\/p>\n<ul>\n<li>word-level clustering of types of pronunciation and extraction of prototypical pronunciations<\/li>\n<li>linguistic processing of transcripts of spoken data, potentially informed by the speech signal itself<\/li>\n<li>disfluency identification and classification<\/li>\n<li>dialogue act classification<\/li>\n<li>identifying ways to build large and cheap spoken corpora of South Slavic languages<\/li>\n<\/ul>\n<p>Please do get in touch if you are interested, or already working on speech. Also, we invite similar e-mails \u2013 drafting future activities \u2013 from other sides as well! We need coordination between different efforts, something we discussed to great length in our <a href=\"https:\/\/www.degruyter.com\/document\/doi\/10.1515\/9783110767377-017\/html\" target=\"_blank\" rel=\"noopener\">recently published book chapter<\/a>.<\/p>\n<h3><strong>May 6, 2022 \u2013 Massive monolingual and parallel South Slavic corpora freely available<\/strong><\/h3>\n<p>We are happy to announce that new high-quality monolingual and parallel web corpora for South Slavic languages have been released. The corpora were created in scope of the\u00a0<a href=\"https:\/\/macocu.eu\/\" target=\"_blank\" rel=\"noopener noreferrer\">MaCoCu<\/a>\u00a0project, which focuses on collecting monolingual and parallel data from the Internet for European under-resourced languages, South Slavic languages included.<\/p>\n<p>The datasets were built by crawling the national top-level domains, extending the crawl dynamically to other domains as well. More information on the corpora construction and links to the freely-available tools that were used for crawling and cleaning can be found in the description of resources, published on the CLARIN.SI repository (see links below).<\/p>\n<p>The following new South Slavic corpora are freely available from the CLARIN.SI repository:<\/p>\n<ul>\n<li><a href=\"http:\/\/hdl.handle.net\/11356\/1516\" target=\"_blank\" rel=\"noopener noreferrer\">Croatian web corpus MaCoCu-hr 1.0<\/a>\u00a0with 2.3 billion words in 7 million texts;<\/li>\n<li><a href=\"http:\/\/hdl.handle.net\/11356\/1517\" target=\"_blank\" rel=\"noopener noreferrer\">Slovene web corpus MaCoCu-sl 1.0<\/a>\u00a0with 1.8 billion words in 5.8 million texts;<\/li>\n<li><a href=\"http:\/\/hdl.handle.net\/11356\/1512\" target=\"_blank\" rel=\"noopener noreferrer\">Macedonian web corpus MaCoCu-mk 1.0<\/a>\u00a0with 0.5 billion words in 1.96 million texts;<\/li>\n<li><a href=\"http:\/\/hdl.handle.net\/11356\/1515\" target=\"_blank\" rel=\"noopener noreferrer\">Bulgarian web corpus MaCoCu-bg 1.0<\/a>\u00a0with 3.5 billion words in 10.5 million texts;<\/li>\n<li><a href=\"http:\/\/hdl.handle.net\/11356\/1522\" target=\"_blank\" rel=\"noopener noreferrer\">Croatian-English parallel corpus MaCoCu-hr-en 1.0<\/a>\u00a0with 135 million words in 3 million segments (sentence pairs);<\/li>\n<li><a href=\"http:\/\/hdl.handle.net\/11356\/1523\" target=\"_blank\" rel=\"noopener noreferrer\">Slovene-English parallel corpus MaCoCu-sl-en 1.0<\/a>\u00a0with 137 million words in 3 million segments;<\/li>\n<li><a href=\"http:\/\/hdl.handle.net\/11356\/1513\" target=\"_blank\" rel=\"noopener noreferrer\">Macedonian-English parallel corpus MaCoCu-mk-en 1.0<\/a>\u00a0with 24 million words in 0.48 million segments;<\/li>\n<li><a href=\"http:\/\/hdl.handle.net\/11356\/1521\" target=\"_blank\" rel=\"noopener noreferrer\">Bulgarian-English parallel corpus MaCoCu-bg-en 1.0<\/a>\u00a0with 159 million words in 3.9 million segments.<\/li>\n<\/ul>\n<p>We are already working on using the above datasets for BERT-like language model pre-training, and producing linguistically-annotated corpora that will be available through our concordancers. Next year, the corpora will be upgraded and additional South Slavic monolingual and parallel corpora will be released, i.e., Bosnian, Serbian and Montenegrin.<\/p>\n<h3><strong>April 20, 2022 \u2013 First open speech-to-text system and ASR training dataset for Croatian<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">The first open speech-to-text system for Croatian is <\/span><a href=\"https:\/\/huggingface.co\/classla\/wav2vec2-xls-r-parlaspeech-hr\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">now available<\/span><\/a><span style=\"font-weight: 400;\"> in the Hugging Face model hub. The system is currently based on 72 hours of transcripts of parliamentary debates from the Croatian parliament. The <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1494\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">ASR training dataset for Croatian ParlaSpeech-HR v1.0<\/span><\/a> <span style=\"font-weight: 400;\">is freely available in the CLARIN.SI repository. The dataset and the system were developed by Nikola Ljube\u0161i\u0107, Ivo-Pavao Jazbec, Vuk Batanovi\u0107, Lenka Baj\u010deti\u0107, Danijel Korzinek and Peter Rupnik. These results would not have been possible without a wider collaboration around the ParlaMint project, and for that Darja Fi\u0161er, Toma\u017e Erjavec, Maciej Ogrodniczuk and Petya Osenova are to be thanked.<\/span><\/p>\n<h3><strong>December 21, 2021 \u2013 CLASSLA in Tour de CLARIN<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">CLASSLA has been presented by Tour de CLARIN, a CLARIN ERIC initiative which presents its national consortiums, Knowledge Centres and Service Providing Centres (B-centres). Find out more about CLASSLA&#8217;s activities, services and its mission <\/span><a href=\"https:\/\/clarin.eu\/blog\/tour-de-clarin-clarin-knowledge-centre-south-slavic-languages-classla\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">here<\/span><\/a><span style=\"font-weight: 400;\">. The <\/span><a href=\"https:\/\/office.clarin.eu\/v\/CE-2021-1975-TDC-VOL4.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">new volume of Tour de CLARIN<\/span><\/a><span style=\"font-weight: 400;\"> also includes an interview with Zrinka Kolakovi\u0107 in which she shares how she uses our corpora and tools to research South Slavic clitics and aspect. Read more <\/span><a href=\"https:\/\/www.clarin.eu\/blog\/tour-de-clarin-interview-zrinka-kolakovic\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h3><strong>December 13, 2021 \u2013 Workshop on regional markedness in text<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">On 6 and 7 November 2021, an <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/workshops\/#Workshop_on_regional_markedness_in_text\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">online workshop on regional markedness in text<\/span><\/a><span style=\"font-weight: 400;\"> took place, organised by the <\/span><a href=\"https:\/\/reldi.spur.uzh.ch\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">ReLDI<\/span><\/a><span style=\"font-weight: 400;\"> centre, <\/span><a href=\"https:\/\/www.spur.uzh.ch\/en.html\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">University of Zurich<\/span><\/a><span style=\"font-weight: 400;\">, and <\/span><span style=\"font-weight: 400;\">CLASSLA<\/span><span style=\"font-weight: 400;\">. The materials from the workshop are now available <\/span><a href=\"https:\/\/github.com\/clarinsi\/workshop_reg_mark\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">here<\/span><\/a><span style=\"font-weight: 400;\">. They provide a gentle introduction to querying the corpora through the <\/span><a href=\"https:\/\/www.clarin.si\/noske\/index-en.html\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">noSketchEngine<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/www.clarin.si\/kontext\/corpora\/corplist\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">KonText<\/span><\/a><span style=\"font-weight: 400;\"> concordancers, using the Corpus Query Language (CQL) syntax and morphosyntactic descriptions (MSDs) to analyse gender bias in society.<\/span><\/p>\n<h3><strong>November 26, 2021 &#8211; Success Stories<\/strong><\/h3>\n<p><span style=\"font-weight: 400;\">A new entry has been added to the CLARIN Knowledge Centre for South Slavic languages (CLASSLA). In <\/span><a href=\"https:\/\/www.clarin.si\/info\/k-centre\/success-stories\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Success stories<\/span><\/a><span style=\"font-weight: 400;\">, we present activities where collaboration resulted in important language resources for Slovenian, Croatian and Serbian, created with a fraction of the full costs by exploiting the large synergistic potential of South Slavic languages. These are the stories that motivated the creation of CLASSLA.<\/span><\/p>\n<p><a href=\"http:\/\/hdl.handle.net\/11372\/DOC-153\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3805\" src=\"http:\/\/www.clarin.si\/info\/wp-content\/uploads\/2019\/03\/K-centre-logo-300x90.jpeg\" alt=\"\" width=\"537\" height=\"161\" srcset=\"https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2019\/03\/K-centre-logo-300x90.jpeg 300w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2019\/03\/K-centre-logo-768x230.jpeg 768w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2019\/03\/K-centre-logo-1024x307.jpeg 1024w, https:\/\/www.clarin.si\/info\/wp-content\/uploads\/2019\/03\/K-centre-logo.jpeg 1181w\" sizes=\"auto, (max-width: 537px) 100vw, 537px\" \/><\/a><\/p>\n<div id=\"themify_builder_content-3558\" data-postid=\"3558\" class=\"themify_builder_content themify_builder_content-3558 themify_builder\">\n    <\/div>\n<!-- \/themify_builder_content -->\n","protected":false},"excerpt":{"rendered":"<p>The CLARIN Knowledge Centre for South Slavic languages (CLASSLA) offers expertise on language resources and technologies for South Slavic languages. Its basic activities are (1) giving researchers, students, citizen scientists and other interested parties information on the available resources and technologies via its documentation, (2) supporting them in producing, modifying or publishing resources and technologies [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-3558","page","type-page","status-publish","hentry","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3558","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=3558"}],"version-history":[{"count":157,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3558\/revisions"}],"predecessor-version":[{"id":8800,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3558\/revisions\/8800"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=3558"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}