The CLARIN Knowledge Centre for South Slavic languages (CLASSLA) offers expertise on language resources and technologies for South Slavic languages. Its basic activities are (1) giving researchers, students, citizen scientists and other interested parties information on the available resources and technologies via its documentation, (2) supporting them in producing, modifying or publishing resources and technologies via its helpdesk and (3) organizing training activities. Read more about CLASSLA’s activities and its mission in a Tour de CLARIN article, published here.
The helpdesk of CLASSLA can be contacted via email@example.com. The helpdesk offers additional clarifications regarding the CLASSLA documentation (detailed below) and support in using, modifying, producing, or publishing resources and technologies for South Slavic languages.
The Knowledge Centre currently offers frequently asked questions (FAQ) documentation for the Slovene, Croatian, Serbian, Bulgarian and Macedonian language. It also offers documentation on how to use CLARIN.SI web services which currently cover Slovene, Croatian and Serbian.
CLASSLA Blog Posts
- Comparable CLASSLA web corpora of South Slavic languages (December 5, 2023; 3-minutes read)
- CLASSLA-web: Bigger and Better Web Corpora for Croatian, Serbian and Slovenian on CLARIN.SI Concordancers (June 22, 2023; 10-minutes read)
The most relevant announcements, discussed in our mailing list, are made available below. You can subscribe to the mailing list here to be informed of new resources, technologies, events and projects for South Slavic languages.
September 13, 2023 – New state-of-the-art version of CLASSLA-Stanza pipeline for linguistic processing of South Slavic languages
We are delighted to announce the release of an improved CLASSLA-Stanza pipeline, which enables state-of-the-art linguistic processing of Slovenian, Croatian, Serbian, Macedonian and Bulgarian language.
In addition to covering standard varieties of five South Slavic languages, the pipeline also provides special modules for linguistic annotation of non-standard text and web corpora for Slovenian, Croatian and Serbian. The CLASSLA-Stanza annotation tool supports a total of six tasks: tokenization, morphosyntactic annotation, lemmatization, dependency parsing, semantic role labeling, and named-entity recognition. Some of the main improvements that separate CLASSLA-Stanza from the Stanza pipeline are:
- support of external inflectional lexicons which significantly increases performance on morphologically rich languages;
- extended training datasets (beyond Universal Dependencies data) for all included models;
- use of CLARIN.SI-embed word embeddings, trained on significantly larger and more diverse datasets than embeddings used by Stanza;
- specific modules for standard, non-standard and web text.
As a result, we are happy to report that the CLASSLA-Stanza significantly outperforms Stanza, with error reduction between 34% and 98% on the Slovenian official benchmark (see table below which reports the performance using the Micro F1 score). You can find more details on the pipeline improvements and training settings in a technical report “CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages” (Terčon & Ljubešić, 2023).
You can use CLASSLA-Stanza as a python library (documentation is available here) or via an online service (currently available for Slovenian, other languages and modules coming soon). Separate models are also freely available at the CLARIN.SI repository.
These results would not be possible without immense efforts in developing high-quality training datasets together with our collaborators all around Europe. We wish to use this opportunity to most warmly thank all of them!
June 23, 2023 – CLASSLA web corpora of Croatian, Serbian and Slovenian
We are delighted to announce the release of the pilot versions (v0.1) of the CLASSLA web corpora for Croatian (2.3 billion words), Serbian (2.4 billion words) and Slovenian (1.9 billion words). The main features of the newly released corpora, aside from their massive size and recency (crawled in 2022) is their automatic enrichment with genre information and their linguistic processing with the improved CLASSLA-Stanza annotation pipeline (applied version to be released soon). The corpora are available for search via the CLARIN.SI concordancers, Crystal NoSketchEngine, Bonito NoSketchEngine and KonText. The pilot versions of these corpora are intended to gather valuable user feedback, while the official release (v1.0) of the three existing corpora, along with web corpora for Bosnian, Montenegrin, Macedonian, and Bulgarian, is scheduled for later this year.
We warmly welcome you to explore our corpora. Please reach out to us at firstname.lastname@example.org with any ideas for improvements — we will try hard to implement them in the upcoming official release already! We also encourage you to share with us how you plan to use these corpora in your research, as well as any other use cases you may have in mind.
To give you some ideas on how the corpora can be used in your research you are invited to read our blog post on the use of CLASSLA web corpora via the open CLARIN.SI concordancers. The step-by-step tutorial covers a wide range of functionalities of the concordancers, including finding collocations in different genres, analyzing word statistics, and exploring the use of non-standard words. This resource is particularly suited for linguists, language teachers and digital humanists.
April 25, 2023 – A web corpus intermezzo
We were keeping rather silent for some time now due to many developments that required our full capacity. But you can expect reports on interesting resources, tools, and experiments in the following months!
We were, however, not the only ones who were very busy in the previous period. Philipp Wasserscheidt has recently published the PDRS web corpus of Serbian language, 715 million tokens in size. You can find more details on the corpus in the CLARIN.SI repository entry where the corpus is available for download. The corpus is also available via the CLARIN.SI concordancers.
Philipp is also making sure that future users know how to use the corpus. This is slightly last-minute, but maybe still not too late for some of you – a workshop on the PDRS web corpus usage will be held from this Thursday to Saturday in Belgrade. More information is available at https://javnidiskurs.rs/poziv-na-radionicu-pdrs-1-0/.
Since we are on the topic of web corpora, we have two pieces of news to share right away as well:
1. The head of the CLASSLA centre, Nikola Ljubešić, has taken one of the leading roles in the ACL Special Interest Group for Web as a Corpus (SIGWAC). If you are interested in this area of research, you should join the SIG by signing up to the mailing list.
2. We are in the process of releasing the MaCoCu datasets, which are web crawls of various national top-level domains, including those of Slovenia, Croatia, Bosnia and Herzegovina, Montenegro, Serbia, Macedonia and Bulgaria. We are sharing here the link just to the Macedonian dataset. Linguistic processing of the datasets has just started, and will result in the CLASSLA web corpora, to be updated on a biyearly basis.
December 22, 2022 – Looking forward to 2023!
We wanted to wish all of you happy holidays and a successful 2023. To wrap-up a very busy, but also a very successful 2022, we are sharing with you what we will be releasing in the first half of the next year.
We are working on releasing a new version of our CLASSLA-Stanza tool, with the following improvements:
- Minor improvements on usability and programming interface
- New Slovenian models for standard and Internet non-standard language, but also for spoken language (transcripts), most of the improvements being the results of the VERY successful RSDO project
- New standard and non-standard models for Croatian and Serbian, as we are constantly working on improving our data (it is a never-ending game)
- Drastically improved standard model for Macedonian (we resolved numerous errors by extending the training data (previous model was trained only on an Orwell’s novel))
- We will also release the tool through a web interface and a web service, similar to the RSDO interface for linguistic processing of Slovenian (which also uses CLASSLA-Stanza)
Inside the ParlaMint project we are working hard on releasing parliamentary corpora for the Slovenian, Croatian, Bosnian, Serbian and Bulgarian parliaments, which is one of the big coordination successes of the CLASSLA K-centre. Just for comparison, in the first iteration of the ParlaMint corpus, there was only one term of the Croatian parliament covered, while now the corpus will cover six terms. Bosnian and Serbian were not part of the first iteration of the ParlaMint corpus.
Finally, we will also publish our new generation of web corpora, called CLASSLA-web. We already have prepared the raw data for Slovenian (1.8 billion words), Croatian (2.3 billion words), Macedonian (524 million words), and Bulgarian (3.5 billion words), but will release the corpora both for download and through concordancers once we have all the languages fully processed (we are currently processing Bosnian, Montenegrin and Serbian) and data annotated with the latest version of CLASSLA-Stanza.
October 19, 2022 – Our recent activities on speech
We wanted to share with you our recent results on speech processing, something we mentioned will be one of our foci in 2022.
We released two speech datasets. One is in Croatian, the ParlaSpeech-HR dataset, 1816 hours of recordings in size, with accompanying transcriptions and speaker metadata. The dataset is based on the ParlaMint corpus of Croatian parliamentary proceedings. The other dataset is in Serbian, the JuzneVesti-SR dataset, “only” 50 hours in size. It consists of audio recordings and transcripts from the Južne Vesti website and its host show called 15 minuta, with speaker metadata available as well. With each of the datasets, we released also automatic speech recognition (ASR) models on HuggingFace, four Croatian ASR models for the ParlaSpeech-HR dataset, with excellent (but in-domain) word error rate of only 4%, and for now one Serbian ASR model for the JuzneVesti-SR dataset. You are more than welcome to take any of the models or data (all are available under CC-BY-SA). Interestingly, our speech-related efforts were very quickly picked up by the industry as well, featuring our speech and text technologies in a recent blog.
We also published two papers, one on the overall approach to building the ParlaSpeech-HR dataset, another on performing benchmarking for user profiling over the ParlaSpeech-HR dataset.
Given the recent successes in acquiring funding for performing more research on spoken data, in the following years we will be researching many super-interesting speech-related tasks, including:
- word-level clustering of types of pronunciation and extraction of prototypical pronunciations
- linguistic processing of transcripts of spoken data, potentially informed by the speech signal itself
- disfluency identification and classification
- dialogue act classification
- identifying ways to build large and cheap spoken corpora of South Slavic languages
Please do get in touch if you are interested, or already working on speech. Also, we invite similar e-mails – drafting future activities – from other sides as well! We need coordination between different efforts, something we discussed to great length in our recently published book chapter.
May 6, 2022 – Massive monolingual and parallel South Slavic corpora freely available
We are happy to announce that new high-quality monolingual and parallel web corpora for South Slavic languages have been released. The corpora were created in scope of the MaCoCu project, which focuses on collecting monolingual and parallel data from the Internet for European under-resourced languages, South Slavic languages included.
The datasets were built by crawling the national top-level domains, extending the crawl dynamically to other domains as well. More information on the corpora construction and links to the freely-available tools that were used for crawling and cleaning can be found in the description of resources, published on the CLARIN.SI repository (see links below).
The following new South Slavic corpora are freely available from the CLARIN.SI repository:
- Croatian web corpus MaCoCu-hr 1.0 with 2.3 billion words in 7 million texts;
- Slovene web corpus MaCoCu-sl 1.0 with 1.8 billion words in 5.8 million texts;
- Macedonian web corpus MaCoCu-mk 1.0 with 0.5 billion words in 1.96 million texts;
- Bulgarian web corpus MaCoCu-bg 1.0 with 3.5 billion words in 10.5 million texts;
- Croatian-English parallel corpus MaCoCu-hr-en 1.0 with 135 million words in 3 million segments (sentence pairs);
- Slovene-English parallel corpus MaCoCu-sl-en 1.0 with 137 million words in 3 million segments;
- Macedonian-English parallel corpus MaCoCu-mk-en 1.0 with 24 million words in 0.48 million segments;
- Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 with 159 million words in 3.9 million segments.
We are already working on using the above datasets for BERT-like language model pre-training, and producing linguistically-annotated corpora that will be available through our concordancers. Next year, the corpora will be upgraded and additional South Slavic monolingual and parallel corpora will be released, i.e., Bosnian, Serbian and Montenegrin.
April 20, 2022 – First open speech-to-text system and ASR training dataset for Croatian
The first open speech-to-text system for Croatian is now available in the Hugging Face model hub. The system is currently based on 72 hours of transcripts of parliamentary debates from the Croatian parliament. The ASR training dataset for Croatian ParlaSpeech-HR v1.0 is freely available in the CLARIN.SI repository. The dataset and the system were developed by Nikola Ljubešić, Ivo-Pavao Jazbec, Vuk Batanović, Lenka Bajčetić, Danijel Korzinek and Peter Rupnik. These results would not have been possible without a wider collaboration around the ParlaMint project, and for that Darja Fišer, Tomaž Erjavec, Maciej Ogrodniczuk and Petya Osenova are to be thanked.
December 21, 2021 – CLASSLA in Tour de CLARIN
CLASSLA has been presented by Tour de CLARIN, a CLARIN ERIC initiative which presents its national consortiums, Knowledge Centres and Service Providing Centres (B-centres). Find out more about CLASSLA’s activities, services and its mission here. The new volume of Tour de CLARIN also includes an interview with Zrinka Kolaković in which she shares how she uses our corpora and tools to research South Slavic clitics and aspect. Read more here.
December 13, 2021 – Workshop on regional markedness in text
On 6 and 7 November 2021, an online workshop on regional markedness in text took place, organised by the ReLDI centre, University of Zurich, and CLASSLA. The materials from the workshop are now available here. They provide a gentle introduction to querying the corpora through the noSketchEngine and KonText concordancers, using the Corpus Query Language (CQL) syntax and morphosyntactic descriptions (MSDs) to analyse gender bias in society.
November 26, 2021 – Success Stories
A new entry has been added to the CLARIN Knowledge Centre for South Slavic languages (CLASSLA). In Success stories, we present activities where collaboration resulted in important language resources for Slovenian, Croatian and Serbian, created with a fraction of the full costs by exploiting the large synergistic potential of South Slavic languages. These are the stories that motivated the creation of CLASSLA.