Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije
Common Language Resources and Technology Infrastructure, Slovenia

CLASSLA: Knowledge centre for South Slavic languages

The CLARIN Knowledge Centre for South Slavic languages (CLASSLA) offers expertise on language resources and technologies for South Slavic languages. Its basic activities are (1) giving researchers, students, citizen scientists and other interested parties information on the available resources and technologies via its documentation, (2) supporting them in producing, modifying or publishing resources and technologies via its helpdesk and (3) organizing training activities. Read more about CLASSLA’s activities and its mission in a Tour de CLARIN article, published here.

The helpdesk of CLASSLA can be contacted via helpdesk.classla@clarin.si. The helpdesk offers additional clarifications regarding the CLASSLA documentation (detailed below) and support in using, modifying, producing, or publishing resources and technologies for South Slavic languages.

The Knowledge Centre currently offers frequently asked questions (FAQ) documentation for the Slovene, Croatian, Serbian, Bulgarian and Macedonian language. It also offers documentation on how to use CLARIN.SI web services which currently cover Slovene, Croatian and Serbian.

The most relevant announcements, discussed in our mailing list, are made available below. You can subscribe to the mailing list here to be informed of new resources, technologies, events and projects for South Slavic languages.

CLASSLA is operated by CLARIN.SI, the Institute of Croatian Language, and CLADA-BG.

CLASSLA Blog Posts

Recent Announcements

December 24, 2023 – Updates on new Macedonian resources, and South Slavic endeavours to develop LLMs and ASR models

This year has been extremely packed with activities, hence this very last-minute cheer – we are on a good track to become much less of a less-resourced language family! We give a few examples that come to mind first.

Macedonian has arrived to Universal Dependencies 🥳🥳🥳 thanks to Vladimir Cvetkoski, this may be “only” 155 sentences and 1.360 tokens, but, hey – it is infinitely more than there was before. Bravo, Vladimir!
– CLASSLA followed the great example of Vladimir and decided to publish SETimes.MK in its current status as version 0.1, 570 sentences and 13.310 tokens in size, annotated on XPOS, UPOS, FEATS and LEMMA level, to give additional momentum to the positively developing situation for Macedonian.
– In Slovenia, the PoVeJMo project has started, focused on adapting an LLM to Slovenian language in general, as well as adapting it to a series of industrial use cases.
– Andrija Sagić, a multimedia enthusiast, is seriously biting in the speech apple, additionally fine-tuning the really great whisper-large-v3 model on all the data he can scrape together for Serbian, which mostly includes our Južne Vesti dataset. We are now working with Andrija on improving the dataset (quite many typos in the human transcript!) and are looking forward to jointly publishing a version 2.0. This is the type of collaboration we are very much in need of!
– The ReLDI team has started, together with ICEF, Belgrade, the industry-funded (you do not see many of those!) ComText.SR project on collecting, curating, annotating and publicly releasing textual data for various domains of special interest to the industry.
– The JeRTeh society has started publishing transformer models for Serbian, the first two models being named Vrabac and Orao. You guess which is the bigger one. 🙂 We were told there will be additional models coming from that direction and we are very much looking forward to those!
– You might have followed on social media the most productive project we have ever seen –  the yugoGPT model – work of Aleksa Gordić. We were happy to be able to support Aleksa at least on the data and some discussion front. It was not easy to keep up with that guy! Wow! We really hope this is not Aleksa’s (first? and) last HBS LLM rodeo!

December 6, 2023 – Comparable web corpora CLASSLA-web for all South Slavic languages

We are delighted to announce the release of comparable web corpora for all official South Slavic languages, namely Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian and Bulgarian, all the corpora summing up to almost 11 billion words! The corpora are freely available on the CLARIN.SI NoSketch Engine concordancer (see our recent tutorial on how to easily query the CLASSLA web corpora and perform statistical analyses via the concordancer).

This collection of corpora is very innovative, due to the following reasons:

  • This is, to the best of our knowledge, the first collection of comparable web corpora covering a whole language group.
  • The collection includes the first general, linguistically annotated corpora for two out of seven languages, namely Montenegrin and Macedonian.
  • The comparability of the corpora is ensured by performing data collection and filtering in the same time period with the same technologies. Furthermore, the corpora underwent a uniform linguistic processing via the CLASSLA-Stanza toolkit, which you can now try out also through the CLASSLA annotator web interface.
  • Each of the documents in each of the corpora is annotated with the X-GENRE multilingual genre classifier

For more details, we warmly invite you to read our new blog post which introduces the CLASSLA-web corpora. The blog post provides more details on the corpora sizes and interesting insights on the correlations between genre distributions and GDP per capita across the seven South Slavic countries.

We will be very glad to obtain feedback on our corpora and annotation technology. As usual, please write to us on helpdesk.classla@clarin.si!

These corpora would not have been released without great collaboration inside the CLASSLA Knowledge centre for South Slavic languages, which includes the Slovenian consortium CLARIN.SI, the Institute of Croatian Language, and the Bulgarian consortium CLADA-BG. Furthermore, very crucial were the longstanding collaboration with the ReLDI centre on a series of South Slavic languages, and Biljana Stojanovska and Katerina Zdravkova on Macedonian. On this occasion, we want to thank everyone for the collaboration, and invite others to join our common efforts!

September 13, 2023 – New state-of-the-art version of CLASSLA-Stanza pipeline for linguistic processing of South Slavic languages

We are delighted to announce the release of an improved CLASSLA-Stanza pipeline, which enables state-of-the-art linguistic processing of Slovenian, Croatian, Serbian, Macedonian and Bulgarian language.

In addition to covering standard varieties of five South Slavic languages, the pipeline also provides special modules for linguistic annotation of non-standard text and web corpora for Slovenian, Croatian and Serbian. The CLASSLA-Stanza annotation tool supports a total of six tasks: tokenization, morphosyntactic annotation, lemmatization, dependency parsing, semantic role labeling, and named-entity recognition. Some of the main improvements that separate CLASSLA-Stanza from the Stanza pipeline are:

  • support of external inflectional lexicons which significantly increases performance on morphologically rich languages;
  • extended training datasets (beyond Universal Dependencies data) for all included models;
  • use of CLARIN.SI-embed word embeddings, trained on significantly larger and more diverse datasets than embeddings used by Stanza;
  • specific modules for standard, non-standard and web text.

As a result, we are happy to report that the CLASSLA-Stanza significantly outperforms Stanza, with error reduction between 34% and 98% on the Slovenian official benchmark (see table below which reports the performance using the Micro F1 score). You can find more details on the pipeline improvements and training settings in a technical report “CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages” (Terčon & Ljubešić, 2023).

You can use CLASSLA-Stanza as a python library (documentation is available here) or via an online service (currently available for Slovenian, other languages and modules coming soon). Separate models are also freely available at the CLARIN.SI repository.

These results would not be possible without immense efforts in developing high-quality training datasets together with our collaborators all around Europe. We wish to use this opportunity to most warmly thank all of them!

June 23, 2023 – CLASSLA web corpora of Croatian, Serbian and Slovenian

We are delighted to announce the release of the pilot versions (v0.1) of the CLASSLA web corpora for Croatian (2.3 billion words), Serbian (2.4 billion words) and Slovenian (1.9 billion words). The main features of the newly released corpora, aside from their massive size and recency (crawled in 2022) is their automatic enrichment with genre information and their linguistic processing with the improved CLASSLA-Stanza annotation pipeline (applied version to be released soon). The corpora are available for search via the CLARIN.SI concordancers, Crystal NoSketchEngine, Bonito NoSketchEngine and KonText. The pilot versions of these corpora are intended to gather valuable user feedback, while the official release (v1.0) of the three existing corpora, along with web corpora for Bosnian, Montenegrin, Macedonian, and Bulgarian, is scheduled for later this year.

We warmly welcome you to explore our corpora. Please reach out to us at helpdesk.classla@clarin.si with any ideas for improvements we will try hard to implement them in the upcoming official release already! We also encourage you to share with us how you plan to use these corpora in your research, as well as any other use cases you may have in mind.

To give you some ideas on how the corpora can be used in your research you are invited to read our blog post on the use of CLASSLA web corpora via the open CLARIN.SI concordancers. The step-by-step tutorial covers a wide range of functionalities of the concordancers, including finding collocations in different genres, analyzing word statistics, and exploring the use of non-standard words. This resource is particularly suited for linguists, language teachers and digital humanists.

April 25, 2023 – A web corpus intermezzo

We were keeping rather silent for some time now due to many developments that required our full capacity. But you can expect reports on interesting resources, tools, and experiments in the following months!

We were, however, not the only ones who were very busy in the previous period. Philipp Wasserscheidt has recently published the PDRS web corpus of Serbian language, 715 million tokens in size. You can find more details on the corpus in the CLARIN.SI repository entry where the corpus is available for download. The corpus is also available via the CLARIN.SI concordancers.

Philipp is also making sure that future users know how to use the corpus. This is slightly last-minute, but maybe still not too late for some of you – a workshop on the PDRS web corpus usage will be held from this Thursday to Saturday in Belgrade. More information is available at https://javnidiskurs.rs/poziv-na-radionicu-pdrs-1-0/.

Since we are on the topic of web corpora, we have two pieces of news to share right away as well:

1. The head of the CLASSLA centre, Nikola Ljubešić, has taken one of the leading roles in the ACL Special Interest Group for Web as a Corpus (SIGWAC). If you are interested in this area of research, you should join the SIG by signing up to the mailing list.

2. We are in the process of releasing the MaCoCu datasets, which are web crawls of various national top-level domains, including those of Slovenia, Croatia, Bosnia and Herzegovina, Montenegro, Serbia, Macedonia and Bulgaria. We are sharing here the link just to the Macedonian dataset. Linguistic processing of the datasets has just started, and will result in the CLASSLA web corpora, to be updated on a biyearly basis.

December 22, 2022 – Looking forward to 2023!

We wanted to wish all of you happy holidays and a successful 2023. To wrap-up a very busy, but also a very successful 2022, we are sharing with you what we will be releasing in the first half of the next year.

We are working on releasing a new version of our CLASSLA-Stanza tool, with the following improvements:

  • Minor improvements on usability and programming interface
  • New Slovenian models for standard and Internet non-standard language, but also for spoken language (transcripts), most of the improvements being the results of the VERY successful RSDO project
  • New standard and non-standard models for Croatian and Serbian, as we are constantly working on improving our data (it is a never-ending game)
  • Drastically improved standard model for Macedonian (we resolved numerous errors by extending the training data (previous model was trained only on an Orwell’s novel))
  • We will also release the tool through a web interface and a web service, similar to the RSDO interface for linguistic processing of Slovenian (which also uses CLASSLA-Stanza)

Inside the ParlaMint project we are working hard on releasing parliamentary corpora for the Slovenian, Croatian, Bosnian, Serbian and Bulgarian parliaments, which is one of the big coordination successes of the CLASSLA K-centre. Just for comparison, in the first iteration of the ParlaMint corpus, there was only one term of the Croatian parliament covered, while now the corpus will cover six terms. Bosnian and Serbian were not part of the first iteration of the ParlaMint corpus.

Finally, we will also publish our new generation of web corpora, called CLASSLA-web. We already have prepared the raw data for Slovenian (1.8 billion words), Croatian (2.3 billion words), Macedonian (524 million words), and Bulgarian (3.5 billion words), but will release the corpora both for download and through concordancers once we have all the languages fully processed (we are currently processing Bosnian, Montenegrin and Serbian) and data annotated with the latest version of CLASSLA-Stanza.

October 19, 2022 – Our recent activities on speech

We wanted to share with you our recent results on speech processing, something we mentioned will be one of our foci in 2022.

We released two speech datasets. One is in Croatian, the ParlaSpeech-HR dataset, 1816 hours of recordings in size, with accompanying transcriptions and speaker metadata. The dataset is based on the ParlaMint corpus of Croatian parliamentary proceedings. The other dataset is in Serbian, the JuzneVesti-SR dataset, “only” 50 hours in size. It consists of audio recordings and transcripts from the Južne Vesti website and its host show called 15 minuta, with speaker metadata available as well. With each of the datasets, we released also automatic speech recognition (ASR) models on HuggingFace, four Croatian ASR models for the ParlaSpeech-HR dataset, with excellent (but in-domain) word error rate of only 4%, and for now one Serbian ASR model for the JuzneVesti-SR dataset. You are more than welcome to take any of the models or data (all are available under CC-BY-SA). Interestingly, our speech-related efforts were very quickly picked up by the industry as well, featuring our speech and text technologies in a recent blog.

We also published two papers, one on the overall approach to building the ParlaSpeech-HR dataset, another on performing benchmarking for user profiling over the ParlaSpeech-HR dataset.

Given the recent successes in acquiring funding for performing more research on spoken data, in the following years we will be researching many super-interesting speech-related tasks, including:

  • word-level clustering of types of pronunciation and extraction of prototypical pronunciations
  • linguistic processing of transcripts of spoken data, potentially informed by the speech signal itself
  • disfluency identification and classification
  • dialogue act classification
  • identifying ways to build large and cheap spoken corpora of South Slavic languages

Please do get in touch if you are interested, or already working on speech. Also, we invite similar e-mails – drafting future activities – from other sides as well! We need coordination between different efforts, something we discussed to great length in our recently published book chapter.

May 6, 2022 – Massive monolingual and parallel South Slavic corpora freely available

We are happy to announce that new high-quality monolingual and parallel web corpora for South Slavic languages have been released. The corpora were created in scope of the MaCoCu project, which focuses on collecting monolingual and parallel data from the Internet for European under-resourced languages, South Slavic languages included.

The datasets were built by crawling the national top-level domains, extending the crawl dynamically to other domains as well. More information on the corpora construction and links to the freely-available tools that were used for crawling and cleaning can be found in the description of resources, published on the CLARIN.SI repository (see links below).

The following new South Slavic corpora are freely available from the CLARIN.SI repository:

We are already working on using the above datasets for BERT-like language model pre-training, and producing linguistically-annotated corpora that will be available through our concordancers. Next year, the corpora will be upgraded and additional South Slavic monolingual and parallel corpora will be released, i.e., Bosnian, Serbian and Montenegrin.

April 20, 2022 – First open speech-to-text system and ASR training dataset for Croatian

The first open speech-to-text system for Croatian is now available in the Hugging Face model hub. The system is currently based on 72 hours of transcripts of parliamentary debates from the Croatian parliament. The ASR training dataset for Croatian ParlaSpeech-HR v1.0 is freely available in the CLARIN.SI repository. The dataset and the system were developed by Nikola Ljubešić, Ivo-Pavao Jazbec, Vuk Batanović, Lenka Bajčetić, Danijel Korzinek and Peter Rupnik. These results would not have been possible without a wider collaboration around the ParlaMint project, and for that Darja Fišer, Tomaž Erjavec, Maciej Ogrodniczuk and Petya Osenova are to be thanked.

December 21, 2021 – CLASSLA in Tour de CLARIN

CLASSLA has been presented by Tour de CLARIN, a CLARIN ERIC initiative which presents its national consortiums, Knowledge Centres and Service Providing Centres (B-centres). Find out more about CLASSLA’s activities, services and its mission here. The new volume of Tour de CLARIN also includes an interview with Zrinka Kolaković in which she shares how she uses our corpora and tools to research South Slavic clitics and aspect. Read more here.

December 13, 2021 – Workshop on regional markedness in text

On 6 and 7 November 2021, an online workshop on regional markedness in text took place, organised by the ReLDI centre, University of Zurich, and CLASSLA. The materials from the workshop are now available here. They provide a gentle introduction to querying the corpora through the noSketchEngine and KonText concordancers, using the Corpus Query Language (CQL) syntax and morphosyntactic descriptions (MSDs) to analyse gender bias in society.

November 26, 2021 – Success Stories

A new entry has been added to the CLARIN Knowledge Centre for South Slavic languages (CLASSLA). In Success stories, we present activities where collaboration resulted in important language resources for Slovenian, Croatian and Serbian, created with a fraction of the full costs by exploiting the large synergistic potential of South Slavic languages. These are the stories that motivated the creation of CLASSLA.