CLASSLA-Express: Workshops on using CLARIN.SI spoken and web corpora in the analysis of language variation

	Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije Common Language Resources and Technology Infrastructure, Slovenia

The CLASSLA-Express series of workshops is just entering its 3rd iteration (2026 edition), aiming to train participants in how to use the CLASSLA-web corpora and ParlaSpeech spoken corpora in research of language variation. The workshops comprise hands-on exercises showing how to query and analyse textual and speech data in various South Slavic languages.

Registration

The registration for the pre-conference event at HDPL 2026 in Pula, Croatia, is now open! The registration for the pre-conference events at JTDH 2026 in Ljubljana, Slovenia, and at JuDig 2026 in Belgrade, Serbia, will follow soon.

Description

The workshops offer hands-on work with spoken and web corpora. In the first part, participants will explore the ParlaSpeech parliamentary speech corpus using the NoSketchEngine concordancer, analyse concordances, and extract variables and speech features with the help of Praat.

To get a first look at the ParlaSpeech corpus, you can watch the following short video guides:

In the second part, participants will work with CLASSLA-web corpora, build subcorpora using the NoSketchEngineLog concordancer, and compare lexical patterns across different genres and domains.

To get familiar with CLASSLA-web corpora in advance, we recommend the following video guide:

Introducing CLASSLA-web

Participants will learn how to:

formulate corpus queries
analyse concordances and metadata
extract and interpret linguistic variables from concordance data
build subcorpora
compare lexical patterns
explore features of spoken language

Who is it for? The workshops are intended for students, linguists, phoneticians, lexicographers, language teachers, digital humanities researchers, and anyone interested in corpus-based language research.

Contents hide

4.1 Programme of CLASSLA-Express 1.0

4.2 Programme of CLASSLA-Express 2.0

5 The CLASSLA-Express Team

6 About CLASSLA

The first two itereations of the workshops were based on using the CLARIN.SI NoSketch Engine concordancer in order to obtain data on meanings and uses of words, word forms, collocations and grammatical patterns. Moreover, the exercises enable participants to explore uses of words and lexico-grammatical constructions in different types of texts. Additionally, in the 2025 edition of the CLASSLA-Express workshops, CLASSLA-Express 2.0, the focus is on testing how large language models (LLMs) perform linguistic tasks, identifying which tasks can be entrusted to them, and determining the tasks for which the traditional corpus-based research methods and concordancers are more suitable.

The workshops are aimed at university students of South Slavic languages, linguists, lexicographers, language teachers and digital humanities scholars. They provide an opportunity for both beginners and more advanced corpus users to query corpora which reflect contemporary language use. Furthermore, the workshops introduce users and researchers of Macedonian to the first general corpus of that language – CLASSLA-web.mk. The practical skills which participants will acquire may be applied in language teaching (both L1 and L2), designing corpus-informed dictionaries and grammars, as well as in digital humanities.

The CLASSLA-express series consists of two versions of workshops. In the CLASSLA-Express 1.0 version, we introduce participants to the established corpus-linguistic approach based on the use of corpora in concordancers. In 2025, we expanded the series with CLASSLA-Express 2.0, which incorporates large language models (LLMs) for corpus-linguistic tasks and compares their performance with traditional methods. The 2025 workshops focus on two key objectives: first, to continue the goal of sharing knowledge on the usage of language tools from the first edition; and second, to contribute to the development of a methodological framework for integrating AI tools into linguistic research. This framework combines AI tools with corpus-based methods and establishes criteria for evaluating their results. In doing so, we further explore the benefits and challenges of incorporating LLMs into corpus linguistic research, aiming to create a methodologically sound approach for combining these two powerful toolsets.

On 16 October 2024, the CLASSLA-Express initiative was introduced to the CLARIN community in a paper presented at the CLARIN Annual Conference 2024 in Barcelona. The CLASSLA-Express paper and the slides from the presentation are published online.

Learn more about past workshops and view photos in the reports available here.

Past Workshops

Past workshops took place from April 2024 onward in 11 cities in six countries:

19 April 2024 – CLASSLA-Express 1.0 stop in Zagreb, Croatia (Faculty of Humanities and Social Sciences, University of Zagreb). Read the report on the event here.
26 April 2024 – CLASSLA-Express 1.0 stop in Rijeka, Croatia (Center for Language Research, Faculty of Humanities and Social Sciences, University of Rijeka). Read the report on the event here.
29 May 2024 – CLASSLA-Express 1.0 stop in Belgrade, Serbia (International conference Leksikografski susreti, Faculty of Philology, University of Belgrade). Read the report on the event here.
4 June 2024 – CLASSLA-Express 1.0 stop in Skopje, North Macedonia (Blaže Koneski Faculty of Philology, Ss. Cyril and Methodius University). Read the report on the event here.
26 June 2024 – CLASSLA-Express 1.0 stop in Sofia, Bulgaria (International CLaDA-BG Conference 2024).
5 September 2024 – a workshop by Jelena Parizoska and Ivana Filipović Petrović for the members of the PhraConRep COST Action in Osijek, Croatia (Faculty of Humanities and Social Sciences).
18 September 2024 – CLASSLA-Express 1.0 stop in Ljubljana, Slovenia (Language Technologies & Digital Humanities Conference 2024, University of Ljubljana). Read the report on the event here.
20 November 2024 – CLASSLA-Express 1.0 stop in Kragujevac, Serbia (Faculty of Philology and Arts, University of Kragujevac). Read the report on the event here.
9 January 2025 – a workshop by Petya Osenova in Sofia, Bulgaria (Institute for Bulgarian Language). Read the report on the event here.
20 February and 28 March 2025 – workshops facilitated by Jelena Parizoska with hands-on exercises based on the CLASSLA-Express 1.0 materials in Zagreb, Croatia (Faculty of Teacher Education and Faculty of Humanities and Social Sciences).
4 April 2025 – CLASSLA-Express 1.0 stop in Klagenfurt, Austria (University of Klagenfurt).
15 April 2025 – a workshop by Jelena Parizoska and Ivana Filipović Petrović for members of the PhraConRep COST Action in Zagreb, Croatia (Faculty of Humanities and Social Sciences).
11 June 2025 – CLASSLA-Express 2.0 stop in Zagreb, Croatia as part of HDPL 2025 (Faculty of Humanities and Social Sciences, University of Zagreb). Read the report on the event here.
10 October 2025 – CLASSLA-Express 1.0 Plus (corpora meets AI) stop in Graz, Austria (Institute for Slavic Studies, University of Graz). Read the report on the event here.
17 November 2025 – CLASSLA-Express 2.0 stop in Bled, Slovenia as part of eLex 2025 conference (Rikli Balance Hotel). Read the report on the event here.

Reports from the workshops:

Past Programmes

Programme of CLASSLA-Express 1.0

1. Part 1 (90 minutes)

Introduction: The usage-based approach to language; the advantages of using corpora in language research (with examples from the CLASSLA-web corpora)
Exercises:
- Basic search: Concordances, Text types
- Advanced search: Filter, Collocations, Frequency

2. Coffee break (30 minutes)

3. Part II (90 minutes):

Exercises: CQL queries

4. Discussion and closing (30 minutes)

Programme of CLASSLA-Express 2.0

Part I (90 minutes)
- Introduction:
  - An Overview of Specific Enrichments within the CLASSLA-web Corpora
  - Large Language Models and Generative Artificial Intelligence – An Introduction
  - Examples of Using AI Tools in South Slavic Languages Research
Coffee Break (30 minutes)
Part II (approx. two to three hours): Hands-On Exercises – Corpora and AI-Driven Interfaces
- Exercises on extracting linguistic data, creating definitions, and providing usage examples of phraseological units for dictionaries
- Distinguishing literal and figurative uses of phraseological units
- Identifying the distinction between animate and inanimate entities
Discussion and closing (30 minutes)

The CLASSLA-Express Team

The team is made up of two linguists and three NLP scholars from three CLARIN member countries: Bulgaria (CLaDA-BG), Croatia (HR-CLARIN), and Slovenia (CLARIN.SI). The CLASSLA-Express workshops are supported by CLARIN.SI, CLaDA-BG, the Croatian Applied Linguistics Society (HDPL), and the LLM4DH project.

Dr Ivana Filipović Petrović is a Senior Research Associate at the Linguistic Research Institute of the Croatian Academy of Sciences and Arts, specializing in phraseology, historical and electronic lexicography, and the application of digital tools in lexicography. She is involved in several projects, coordinating the online Dictionary of Croatian Idioms project and contributing as a researcher to the Dictionary of the Croatian Literary Language (vol. S–Ž) and the Croatian Science Foundation Project Semantic-Syntactic Classification of Verbs in Croatian. Ivana is active in the COST Actions PhraConRep, where she leads workshops on using corpora and AI tools for researching phraseme constructions, and ENEOLI, where she coordinates Croatian equivalents in a multilingual neology thesaurus. In addition to her research, Ivana has co-organized and co-convened several workshops on using CLARIN.SI corpora in language research across South Slavic countries. Ivana was part of the local organizing committee for the Euralex 2024 Congress, co-chaired the organizing committee of the LinguaDOC conference for doctoral students, received the 2022 Best Project award at the Linguistic Linked Open Data Datathon, and is currently the Head of the Zagreb Branch of the Croatian Society for Applied Linguistics and a member of the editorial board of Studia Lexicographica.

Dr Jelena Parizoska is an Assistant Professor at the Faculty of Teacher Education, University of Zagreb. She holds a PhD in Linguistics from the University of Zagreb. Her research focuses on idiomatic expressions in English and Croatian within the theoretical framework of Cognitive Linguistics. She has taught university courses in figurative language, English syntax, lexicology and lexicography, and English for Specific Purposes. She is currently involved in two COST Actions: CA21167 Universality, diversity and idiosyncrasy in language technology (2022-2026) and CA22115 A Multilingual Repository of Phraseme Constructions in Central and Eastern European Languages (2023-2027).

Dr Petya Osenova is professor, PhD in Contemporary Bulgarian Grammar (morphology, syntax and corpus linguistics) in the Faculty of Slavic Studies at Sofia University “St. Kl. Ohridski” and senior researcher in the Department of AI and Language Technologies at the Institute of Information and Communication Technologies, Bulgarian Academy of Sciences. Her scientific interests are in the fields of formal and computational linguistics, language resources, language modelling, lexicon-grammar interface. She was a key person in a number of EU projects, related to eLearning, Machine Translation, Language. She is the responsible person for the language resources in CLaDA-BG – the CLARIN and DARIAH joint framework in Bulgaria and the Bulgarian representative at the User Involvement Committee in CLARIN-ERIC. Petya Osenova specialized in computational linguistics as a postdoctoral fellow in the Tuebingen University, Germany (2003) and in Groningen University, the Netherlands (2004); as a Fulbrighter at Stanford University, the USA (2010). In 2018 Petya Osenova received the award of Clarivate Analytics for excellence in science research in South-Eastern Europe.

Dr Nikola Ljubešić and Taja Kuzman Pungeršek are researchers from the Department of Knowledge Technologies at the Jožef Stefan Institute, Slovenia. Their research interests cover a broad spectrum of natural language processing (NLP) tasks including web corpus construction (cf., MaCoCu project), development of tools for automatic linguistic annotation for South Slavic languages (cf. CLASSLA-Stanza pipeline), specialization of language models for under-resourced languages (cf., BERTić and XLM-R-BERTić models), development of speech corpora and automatic speech recognition models (cf. ParlaSpeech corpora and MEZZANINE project), benchmarking South Slavic NLP technologies, application of NLP methods to South Slavic dialects (cf. VarDial DIALECT-COPA shared task), and machine learning tasks, including hate speech detection (cf. IMSyPP project), topic detection, and automatic genre identification. They are the leaders of the CLASSLA centre (CLARIN Knowledge Centre for South Slavic languages) which offers expertise on language resources and technologies for South Slavic languages. Nikola and Taja are the main developers behind the CLASSLA-web corpora which are showcased in the CLASSLA-Express workshops.

About CLASSLA

The CLARIN Knowledge Centre for South Slavic languages (CLASSLA) offers expertise on language resources and technologies for South Slavic languages. Its basic activities are (1) giving researchers, students, citizen scientists and other interested parties information on the available resources and technologies via its documentation, (2) supporting them in producing, modifying or publishing resources and technologies via its helpdesk and (3) organizing training activities. Read more about CLASSLA’s activities and its mission in a Tour de CLARIN article, published here.

The helpdesk of CLASSLA can be contacted via helpdesk.classla@clarin.si. The helpdesk offers additional clarifications regarding the CLASSLA documentation (detailed below) and support in using, modifying, producing, or publishing resources and technologies for South Slavic languages.

The Knowledge Centre currently offers frequently asked questions (FAQ) documentation for the Slovene, Croatian, Serbian, Bulgarian and Macedonian language. It also offers documentation on how to use CLARIN.SI web services which currently cover Slovene, Croatian and Serbian.

The most relevant announcements, discussed in our mailing list, are made available here. You can subscribe to the mailing list here to be informed of new resources, technologies, events and projects for South Slavic languages.

CLASSLA is operated by CLARIN.SI, the Institute of Croatian Language, and CLaDA-BG.