Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije
Common Language Resources and Technology Infrastructure, Slovenia

CLASSLA-Express: Workshops on using CLARIN.SI corpora in language research

This series of five workshops aims to show participants how to use the CLASSLA web corpora in language research. The workshops comprise hands-on exercises showing how to create queries in corpora for Croatian, Macedonian, Serbian and Slovene.

The workshops are based on using the CLARIN.SI NoSketch Engine concordancer in order to obtain data on meanings and uses of words, word forms, collocations and grammatical patterns. Moreover, the exercises will enable participants to explore uses of words and lexico-grammatical constructions in different types of texts.

The five workshops are aimed at university students of South Slavic languages, linguists, lexicographers, language teachers and digital humanities scholars. They provide an opportunity for both beginners and more advanced corpus users to query corpora which reflect contemporary language use. Furthermore, the workshops will introduce users and researchers of Macedonian to the first general corpus of that language – CLASSLA-web.mk. The practical skills which participants will acquire may be applied in language teaching (both L1 and L2), designing corpus-informed dictionaries and grammars, as well as in digital humanities.

The workshops will take place from April to September in four countries:

Registration and details on the CLASSLA-Express stops

Participation in the workshop is free. You can find more information about the registration and programme in the following webpages:

Agenda

1. Part 1 (90 minutes)

  • Introduction: The usage-based approach to language; the advantages of using corpora in language research (with examples from the CLASSLA-web corpora)
  • Exercises:
    • Basic search: Concordances, Text types
    • Advanced search: Filter, Collocations, Frequency

2. Coffee break (30 minutes)

3. Part II (90 minutes):

  • Exercises: CQL queries 

4. Discussion and closing (30 minutes)

The CLASSLA-Express team

The team is made up of two linguists and two NLP scholars from two CLARIN member countries: Croatia (HR-CLARIN) and Slovenia (CLARIN.SI).

Dr Ivana Filipović Petrović is a Senior Research Associate in the Linguistic Research Institute of the Croatian Academy of Sciences and Arts. She holds a PhD in Croatian Linguistics from the University of Zagreb. Her research focuses on the interface between phraseology and lexicography. She is currently involved in COST Action CA21167 Universality, diversity and idiosyncrasy in language technology (2022-2026). From 2018-2023, she was coordinator of the Dictionary of Croatian Idioms project. The resulting online dictionary, co-authored with Jelena Parizoska, was compiled using data from Croatian web corpora hrWaC and CLASSLA-web.hr. Ivana Filipović Petrović and Jelena Parizoska have co-authored four papers on using corpus data in lexicography. In 2019 they organized and convened three workshops on the use of the hrWaC corpus in studying idiomatic expressions and compiling idioms dictionaries.

Dr Jelena Parizoska is an Assistant Professor at the Faculty of Teacher Education, University of Zagreb. She holds a PhD in Linguistics from the University of Zagreb. Her research focuses on idiomatic expressions in English and Croatian within the theoretical framework of Cognitive Linguistics. She has taught university courses in figurative language, English syntax, lexicology and lexicography, and English for Specific Purposes. She is currently involved in two COST Actions: CA21167 Universality, diversity and idiosyncrasy in language technology (2022-2026) and CA22115 A Multilingual Repository of Phraseme Constructions in Central and Eastern European Languages (2023-2027).

Dr Nikola Ljubešić and Taja Kuzman are researchers from the Department of Knowledge Technologies at the Jožef Stefan Institute, Slovenia. Their research interests cover a broad spectrum of natural language processing (NLP) tasks including web corpus construction (cf., MaCoCu project), development of tools for automatic linguistic annotation for South Slavic languages (cf. CLASSLA-Stanza pipeline), specialization of language models for under-resourced languages (cf., BERTić and XLM-R-BERTić models), development of speech corpora and automatic speech recognition models (cf. ParlaSpeech corpora and MEZZANINE project), benchmarking South Slavic NLP technologies, application of NLP methods to South Slavic dialects (cf. VarDial DIALECT-COPA shared task), and machine learning tasks, including hate speech detection (cf. IMSyPP project), topic detection, and automatic genre identification. They are the leaders of the CLASSLA centre (CLARIN Knowledge Centre for South Slavic languages) which offers expertise on language resources and technologies for South Slavic languages. Nikola and Taja are the main developers behind the CLASSLA-web corpora which are showcased in the CLASSLA-Express workshops.

About CLASSLA

The CLARIN Knowledge Centre for South Slavic languages (CLASSLA) offers expertise on language resources and technologies for South Slavic languages. Its basic activities are (1) giving researchers, students, citizen scientists and other interested parties information on the available resources and technologies via its documentation, (2) supporting them in producing, modifying or publishing resources and technologies via its helpdesk and (3) organizing training activities. Read more about CLASSLA’s activities and its mission in a Tour de CLARIN article, published here.

The helpdesk of CLASSLA can be contacted via helpdesk.classla@clarin.si. The helpdesk offers additional clarifications regarding the CLASSLA documentation (detailed below) and support in using, modifying, producing, or publishing resources and technologies for South Slavic languages.

The Knowledge Centre currently offers frequently asked questions (FAQ) documentation for the SloveneCroatian, SerbianBulgarian and Macedonian language. It also offers documentation on how to use CLARIN.SI web services which currently cover Slovene, Croatian and Serbian.

The most relevant announcements, discussed in our mailing list, are made available here. You can subscribe to the mailing list here to be informed of new resources, technologies, events and projects for South Slavic languages.

CLASSLA is operated by CLARIN.SI, the Institute of Croatian Language, and CLADA-BG.