We are pleased to announce that pilot versions (v0.1) of the CLASSLA-web corpora are now available within the CLASSLA Knowledge Center. The corpora include Croatian (2.3 billion words), Serbian (2.4 billion words) and Slovenian (1.9 billion words).
In addition to the new corpora, a tutorial on the usage of CLASSLA-web corpora through the concordancers CLARIN.SI has been published.
You can read more about the novelties in the CLASSLA Knowledge Center below.
CLASSLA web corpora of Croatian, Serbian and Slovenian
We are delighted to announce the release of the pilot versions (v0.1) of the CLASSLA web corpora for Croatian (2.3 billion words), Serbian (2.4 billion words) and Slovenian (1.9 billion words). The main features of the newly released corpora, aside from their massive size and recency (crawled in 2022) is their automatic enrichment with genre information and their linguistic processing with the improved CLASSLA-Stanza annotation pipeline (applied version to be released soon). The corpora are available for search via the CLARIN.SI concordancers, Crystal NoSketchEngine, Bonito NoSketchEngine and KonText. The pilot versions of these corpora are intended to gather valuable user feedback, while the official release (v1.0) of the three existing corpora, along with web corpora for Bosnian, Montenegrin, Macedonian, and Bulgarian, is scheduled for later this year.
We warmly welcome you to explore our corpora. Please reach out to us at email@example.com with any ideas for improvements — we will try hard to implement them in the upcoming official release already! We also encourage you to share with us how you plan to use these corpora in your research, as well as any other use cases you may have in mind.
To give you some ideas on how the corpora can be used in your research you are invited to read our blog post on the use of CLASSLA web corpora via the open CLARIN.SI concordancers. The step-by-step tutorial covers a wide range of functionalities of the concordancers, including finding collocations in different genres, analyzing word statistics, and exploring the use of non-standard words. This resource is particularly suited for linguists, language teachers and digital humanists.