The CLARIN Knowledge Centre for South Slavic languages (CLASSLA) offers expertise on language resources and technologies for South Slavic languages. Its basic activities are (1) giving researchers, students, citizen scientists and other interested parties information on the available resources and technologies via its documentation, (2) supporting them in producing, modifying or publishing resources and technologies via its helpdesk and (3) organizing training activities. Read more about CLASSLA’s activities and its mission in a Tour de CLARIN article, published here.
The helpdesk of CLASSLA can be contacted via helpdesk.classla@clarin.si. The helpdesk offers additional clarifications regarding the CLASSLA documentation (detailed below) and support in using, modifying, producing, or publishing resources and technologies for South Slavic languages.
The Knowledge Centre currently offers frequently asked questions (FAQ) documentation for the Slovene, Croatian, Serbian, Bulgarian and Macedonian language. It also offers documentation on how to use CLARIN.SI web services which currently cover Slovene, Croatian and Serbian.
CLASSLA is operated by CLARIN.SI, the Institute of Croatian Language and Linguistics, and CLADA-BG.
Recent Announcements
The most relevant announcements, discussed in our mailing list, are made available below. You can subscribe to the mailing list here to be informed of new resources, technologies, events and projects for South Slavic languages.
May 6, 2022 – Massive monolingual and parallel South Slavic corpora freely available
We are happy to announce that new high-quality monolingual and parallel web corpora for South Slavic languages have been released. The corpora were created in scope of the MaCoCu project, which focuses on collecting monolingual and parallel data from the Internet for European under-resourced languages, South Slavic languages included.
The datasets were built by crawling the national top-level domains, extending the crawl dynamically to other domains as well. More information on the corpora construction and links to the freely-available tools that were used for crawling and cleaning can be found in the description of resources, published on the CLARIN.SI repository (see links below).
The following new South Slavic corpora are freely available from the CLARIN.SI repository:
- Croatian web corpus MaCoCu-hr 1.0 with 2.3 billion words in 7 million texts;
- Slovene web corpus MaCoCu-sl 1.0 with 1.8 billion words in 5.8 million texts;
- Macedonian web corpus MaCoCu-mk 1.0 with 0.5 billion words in 1.96 million texts;
- Bulgarian web corpus MaCoCu-bg 1.0 with 3.5 billion words in 10.5 million texts;
- Croatian-English parallel corpus MaCoCu-hr-en 1.0 with 135 million words in 3 million segments (sentence pairs);
- Slovene-English parallel corpus MaCoCu-sl-en 1.0 with 137 million words in 3 million segments;
- Macedonian-English parallel corpus MaCoCu-mk-en 1.0 with 24 million words in 0.48 million segments;
- Bulgarian-English parallel corpus MaCoCu-bg-en 1.0 with 159 million words in 3.9 million segments.
We are already working on using the above datasets for BERT-like language model pre-training, and producing linguistically-annotated corpora that will be available through our concordancers. Next year, the corpora will be upgraded and additional South Slavic monolingual and parallel corpora will be released, i.e., Bosnian, Serbian and Montenegrin.
April 20, 2022 – First open speech-to-text system and ASR training dataset for Croatian
The first open speech-to-text system for Croatian is now available in the Hugging Face model hub. The system is currently based on 72 hours of transcripts of parliamentary debates from the Croatian parliament. The ASR training dataset for Croatian ParlaSpeech-HR v1.0 is freely available in the CLARIN.SI repository. The dataset and the system were developed by Nikola Ljubešić, Ivo-Pavao Jazbec, Vuk Batanović, Lenka Bajčetić, Danijel Korzinek and Peter Rupnik. These results would not have been possible without a wider collaboration around the ParlaMint project, and for that Darja Fišer, Tomaž Erjavec, Maciej Ogrodniczuk and Petya Osenova are to be thanked.
December 21, 2021 – CLASSLA in Tour de CLARIN
CLASSLA has been presented by Tour de CLARIN, a CLARIN ERIC initiative which presents its national consortiums, Knowledge Centres and Service Providing Centres (B-centres). Find out more about CLASSLA’s activities, services and its mission here. The new volume of Tour de CLARIN also includes an interview with Zrinka Kolaković in which she shares how she uses our corpora and tools to research South Slavic clitics and aspect. Read more here.
December 13, 2021 – Workshop on regional markedness in text
On 6 and 7 November 2021, an online workshop on regional markedness in text took place, organised by the ReLDI centre, University of Zurich, and CLASSLA. The materials from the workshop are now available here. They provide a gentle introduction to querying the corpora through the noSketchEngine and KonText concordancers, using the Corpus Query Language (CQL) syntax and morphosyntactic descriptions (MSDs) to analyse gender bias in society.
November 26, 2021 – Success Stories
A new entry has been added to the CLARIN Knowledge Centre for South Slavic languages (CLASSLA). In Success stories, we present activities where collaboration resulted in important language resources for Slovenian, Croatian and Serbian, created with a fraction of the full costs by exploiting the large synergistic potential of South Slavic languages. These are the stories that motivated the creation of CLASSLA.