Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije
Common Language Resources and Technology Infrastructure, Slovenia

CLASSLA-Express: Workshops on using CLARIN.SI corpora, resources and AI tools in language research

This series of workshops aims to show participants how to use the CLASSLA web corpora in language research. The workshops comprise hands-on exercises showing how to create queries in corpora for Bulgarian, Croatian, Macedonian, Serbian and Slovene.

The workshops are based on using the CLARIN.SI NoSketch Engine concordancer in order to obtain data on meanings and uses of words, word forms, collocations and grammatical patterns. Moreover, the exercises will enable participants to explore uses of words and lexico-grammatical constructions in different types of texts. Additionally, in the new edition of the CLASSLA-Express workshops, CLASSLA-Express 2.0, the focus will be on testing how large language models (LLMs) perform linguistic tasks, identifying which tasks can be entrusted to them, and determining the tasks for which the traditional corpus-based research methods and concordancers are more suitable.

The workshops are aimed at university students of South Slavic languages, linguists, lexicographers, language teachers and digital humanities scholars. They provide an opportunity for both beginners and more advanced corpus users to query corpora which reflect contemporary language use. Furthermore, the workshops will introduce users and researchers of Macedonian to the first general corpus of that language – CLASSLA-web.mk. The practical skills which participants will acquire may be applied in language teaching (both L1 and L2), designing corpus-informed dictionaries and grammars, as well as in digital humanities.

The CLASSLA-Express series consists of two versions of workshops. In the CLASSLA-Express 1.0 version, we introduce participants to the established corpus-linguistic approach based on the use of corpora in concordancers. In 2025, we expanded the series with CLASSLA-Express 2.0, which incorporates large language models (LLMs) for corpus-linguistic tasks and compares their performance with traditional methods. The workshops will focus on two key objectives: first, to continue last year’s goal of sharing knowledge on the usage of language tools; and second, to contribute to the development of a methodological framework for integrating AI tools into linguistic research. This framework will combine AI tools with corpus-based methods and establish criteria for evaluating their results. In doing so, we will further explore the benefits and challenges of incorporating LLMs into corpus linguistic research, aiming to create a methodologically sound approach for combining these two powerful toolsets.

On 16 October 2024, the CLASSLA-Express initiative was introduced to the CLARIN community in a paper presented at the CLARIN Annual Conference 2024 in Barcelona. The CLASSLA-Express paper and the slides from the presentation are published online.

Upcoming Workshops

The second edition of CLASSLA-Express workshops will take place from April to November 2025 in 3 countries:

Participation in the workshops is free.

In addition to the main CLASSLA-Express workshops, shorter sessions based on the CLASSLA-Express teaching materials are planned to take place at various venues:

Past Workshops

Past workshops took place from April 2024 onward in 8 cities in five countries:

Reports from the workshops:

Programme

Programme of CLASSLA-Express 1.0

1. Part 1 (90 minutes)

  • Introduction: The usage-based approach to language; the advantages of using corpora in language research (with examples from the CLASSLA-web corpora)
  • Exercises:
    • Basic search: Concordances, Text types
    • Advanced search: Filter, Collocations, Frequency

2. Coffee break (30 minutes)

3. Part II (90 minutes):

  • Exercises: CQL queries 

4. Discussion and closing (30 minutes)

Programme of CLASSLA-Express 2.0

  1. Part I (90 minutes)
    • Introduction:
      • An Overview of Specific Enrichments within the CLASSLA-web Corpora
      • Large Language Models and Generative Artificial Intelligence – An Introduction
      • Examples of Using AI Tools in South Slavic Languages Research
  2. Coffee Break (30 minutes)
  3. Part II (approx. two to three hours): Hands-On Exercises – Corpora and AI-Driven Interfaces
    • Exercises on extracting linguistic data, creating definitions, and providing usage examples of phraseological units for dictionaries
    • Distinguishing literal and figurative uses of phraseological units
    • Identifying the distinction between animate and inanimate entities
  4. Discussion and closing (30 minutes)

The CLASSLA-Express Team

The team is made up of two linguists and three NLP scholars from three CLARIN member countries: Bulgaria (CLaDA-BG), Croatia (HR-CLARIN), and Slovenia (CLARIN.SI). The CLASSLA-Express workshops are supported by CLARIN.SI, CLaDA-BG, the Croatian Applied Linguistics Society (HDPL), and the LLM4DH project.

Dr Ivana Filipović Petrović is a Senior Research Associate at the Linguistic Research Institute of the Croatian Academy of Sciences and Arts, specializing in phraseology, historical and electronic lexicography, and the application of digital tools in lexicography. She is involved in several projects, coordinating the online Dictionary of Croatian Idioms project and contributing as a researcher to the Dictionary of the Croatian Literary Language (vol. S–Ž) and the Croatian Science Foundation Project Semantic-Syntactic Classification of Verbs in Croatian. Ivana is active in the COST Actions PhraConRep, where she leads workshops on using corpora and AI tools for researching phraseme constructions, and ENEOLI, where she coordinates Croatian equivalents in a multilingual neology thesaurus. In addition to her research, Ivana has co-organized and co-convened several workshops on using CLARIN.SI corpora in language research across South Slavic countries. Ivana was part of the local organizing committee for the Euralex 2024 Congress, co-chaired the organizing committee of the LinguaDOC conference for doctoral students, received the 2022 Best Project award at the Linguistic Linked Open Data Datathon, and is currently the Head of the Zagreb Branch of the Croatian Society for Applied Linguistics and a member of the editorial board of Studia Lexicographica.

Dr Jelena Parizoska is an Assistant Professor at the Faculty of Teacher Education, University of Zagreb. She holds a PhD in Linguistics from the University of Zagreb. Her research focuses on idiomatic expressions in English and Croatian within the theoretical framework of Cognitive Linguistics. She has taught university courses in figurative language, English syntax, lexicology and lexicography, and English for Specific Purposes. She is currently involved in two COST Actions: CA21167 Universality, diversity and idiosyncrasy in language technology (2022-2026) and CA22115 A Multilingual Repository of Phraseme Constructions in Central and Eastern European Languages (2023-2027).

Dr Petya Osenova is professor, PhD in Contemporary Bulgarian Grammar (morphology, syntax and corpus linguistics) in the Faculty of Slavic Studies at Sofia University “St. Kl. Ohridski” and senior researcher in the Department of AI and Language Technologies at the Institute of Information and Communication TechnologiesBulgarian Academy of Sciences. Her scientific interests are in the fields of formal and computational linguistics, language resources, language modelling, lexicon-grammar interface. She was a key person in a number of EU projects, related to eLearning, Machine Translation, Language. She is the responsible person for the language resources in CLaDA-BG – the CLARIN and DARIAH joint framework in Bulgaria and the Bulgarian representative at the User Involvement Committee in CLARIN-ERIC. Petya Osenova specialized in computational linguistics as a postdoctoral fellow in the Tuebingen University, Germany (2003) and in Groningen University, the Netherlands (2004); as a Fulbrighter at Stanford University, the USA (2010). In 2018 Petya Osenova received the award of Clarivate Analytics for excellence in science research in South-Eastern Europe.

Dr Nikola Ljubešić and Taja Kuzman are researchers from the Department of Knowledge Technologies at the Jožef Stefan Institute, Slovenia. Their research interests cover a broad spectrum of natural language processing (NLP) tasks including web corpus construction (cf., MaCoCu project), development of tools for automatic linguistic annotation for South Slavic languages (cf. CLASSLA-Stanza pipeline), specialization of language models for under-resourced languages (cf., BERTić and XLM-R-BERTić models), development of speech corpora and automatic speech recognition models (cf. ParlaSpeech corpora and MEZZANINE project), benchmarking South Slavic NLP technologies, application of NLP methods to South Slavic dialects (cf. VarDial DIALECT-COPA shared task), and machine learning tasks, including hate speech detection (cf. IMSyPP project), topic detection, and automatic genre identification. They are the leaders of the CLASSLA centre (CLARIN Knowledge Centre for South Slavic languages) which offers expertise on language resources and technologies for South Slavic languages. Nikola and Taja are the main developers behind the CLASSLA-web corpora which are showcased in the CLASSLA-Express workshops.

About CLASSLA

The CLARIN Knowledge Centre for South Slavic languages (CLASSLA) offers expertise on language resources and technologies for South Slavic languages. Its basic activities are (1) giving researchers, students, citizen scientists and other interested parties information on the available resources and technologies via its documentation, (2) supporting them in producing, modifying or publishing resources and technologies via its helpdesk and (3) organizing training activities. Read more about CLASSLA’s activities and its mission in a Tour de CLARIN article, published here.

The helpdesk of CLASSLA can be contacted via helpdesk.classla@clarin.si. The helpdesk offers additional clarifications regarding the CLASSLA documentation (detailed below) and support in using, modifying, producing, or publishing resources and technologies for South Slavic languages.

The Knowledge Centre currently offers frequently asked questions (FAQ) documentation for the SloveneCroatian, SerbianBulgarian and Macedonian language. It also offers documentation on how to use CLARIN.SI web services which currently cover Slovene, Croatian and Serbian.

The most relevant announcements, discussed in our mailing list, are made available here. You can subscribe to the mailing list here to be informed of new resources, technologies, events and projects for South Slavic languages.

CLASSLA is operated by CLARIN.SI, the Institute of Croatian Language, and CLaDA-BG.