Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije
Common Language Resources and Technology Infrastructure, Slovenia

Events

CLASSLA K Centre Events

The following events, organised by CLASSLA, are presented below:

September 2024: Round Table on the Usage of Large Language Models in Corpus-Linguistic Research

On September 18th, the CLASSLA Knowledge Centre hosted a round table on the use of Large Language Models (LLMs) in corpus-linguistic research. This event was part of the pre-conference program for the Language Technologies and Digital Humanities Conference (JTDH 2024). During the engaging discussion, corpus linguists, computational linguists, and other experts in language technologies shared their perspectives on the benefits and challenges of incorporating LLMs into corpus-linguistic research.

The key takeaways from the discussion are as follows:

  • LLMs are a disruptive technology with significant impact on all fields working with language, which includes corpus linguistics.
  • The rise of LLMs will not render corpora obsolete. There is no need to worry that LLMs will completely replace traditional approaches, they will rather accompany them.
  • Our responsibility is to define a methodologically sound way of combining the traditional and LLM approaches, including where they fit within our workflows and how to evaluate their effectiveness.
  • Corpus-linguistic studies require real examples, source of the information, controlled datasets, and reproducibility – criteria that LLMs, as black-box models, currently cannot meet.
  • On the question of machine-generated content, as long as it is perceived by humans, thereby impacting the language, it is relevant for corpus linguistics, and therefore should be included into corpora.
  • LLMs still struggle with basic counting tasks, so frequency-related analyses should remain within the realm of traditional corpus linguistics and statistical analysis and modeling.
  • LLMs can be a highly valuable tool for enriching text corpora by providing additional metadata, such as information on genre, sense, idiomacity, or other features of the words and texts in the corpus. The most significant breakthrough LLMs may bring lies in the automatic annotation based on multi-layer and semantic features – an area that was previously challenging to automate.
  • Panelists encouraged researchers to experiment with LLMs for automating text annotations and streamlining parts of their research. However, they emphasized the importance of validating the LLMs’ performance on manually-annotated samples for each specific task.

We would like to extend our gratitude to the panelists: Špela Arhar Holdt, Ivana Filipović Petrović, Simon Krek, Taja Kuzman, Jelena Parizoska, Tanja Samardžić, and Luka Terčon. Special thanks also to Nikola Ljubešić, the moderator and event organizer, as well as to the audience for their valuable contributions to the discussion.

 

April to November 2024: CLASSLA-Express – Workshops on using CLARIN.SI corpora in language research

From April to November 2024, a series of workshops CLASSLA-Express took place in 5 countries: Croatia (Zagreb and Rijeka), Serbia (Belgrade and Kragujevac), North Macedonia (Skopje), Bulgaria (Sofia) and Slovenia (Ljubljana). The workshops showed participants how to use the CLASSLA web corpora in language research. They comprised hands-on exercises showing how to create queries in corpora for Bulgarian, Croatian, Macedonian, Serbian and Slovene. More details on the CLASSLA-Express workshops are available here.

Report from the first two stops of CLASSLA-Express: Zagreb and Rijeka

The CLASSLA-Express is in full swing, with two workshops already underway. The first was held on April 19 at the Faculty of Humanities and Social Sciences in Zagreb, with 22 participants. Following that, the second workshop was held at the Faculty of Humanities and Social Sciences in Rijeka, with 16 participants in attendance.

Each workshop was divided into two parts: the first one focused on theory and the second on practical applications. Both workshop convenors gave introductory lectures during the theoretical part. Ivana Filipović Petrović introduced the Slovenian national consortium CLARIN.SI and presented the CLASSLA corpora developed at the CLASSLA Knowledge Centre. Jelena Parizoska provided an overview of the application of computer corpora in linguistic research.

The practical part of the workshops was dedicated to introducing participants to basic and advanced searches in the NoSketch Engine tool, which hosts freely accessible CLASSLA-web corpora. Participants actively engaged in the exercises, posing questions related to their specific linguistic interests. They were particularly interested in formulating CQL queries with the convenors on-site and doing exercises where they had to find results themselves. Many highly up-to-date questions, such as how to use the power of large language models in linguistic research, were raised during the discussion. An overall conclusion was that we still need corpora for trustworthy examples and distributions of human usage of language, but that large language models can be used for enrichment or filtering of the corpus evidence.

Continuing onward, the CLASSLA-Express team is very excited about the upcoming stops.

Report from the CLASSLA-Express stop in Belgrade

The CLASSLA-Express continues its route, with its third workshop successfully completed on May 29 in Belgrade, Serbia. This event was part of the Leksikografski susreti conference held at the Faculty of Philology, University of Belgrade. The workshop, convened by Ivana Filipović Petrović, attracted 21 participants, reflecting a strong interest in complex CQL exercises and corpus research in general.

Similar to the past two workshops, the session was divided into theory and practical application. Ivana Filipović Petrović introduced the Slovenian national consortium CLARIN.SI and CLASSLA Knowledge Centre during the theoretical part. The practical part focused on familiarizing participants with both basic and advanced searches using the CLARIN.SI NoSketch Engine tool, which hosts the freely accessible CLASSLA web corpora. Participants were particularly engaged in the hands-on exercises, especially those involving CQL queries, and showed a keen interest in finding results independently.

The last section of the workshop was dedicated to addressing individual queries from participants related to their corpus research, which proved to be beneficial to everyone involved. Additionally, we are happy to announce that following a proposal by one of the workshop attendees, a new CLASSLA-Express workshop is scheduled to take place in Kragujevac, Serbia in the fall.

Report from the CLASSLA-Express stop in Skopje

The fourth CLASSLA-Express workshop was held on June 4 in Skopje, at Blaže Koneski Faculty of Philology. The event was opened by Dean Vladimir Martinovski. The workshop was convened by Jelena Parizoska, with 14 participants in attendance – faculty members of Ss. Cyril and Methodius University in Skopje and University St. Kliment Ohridski – Bitola, with a background in modern Slavic, Germanic and Romance languages, Language For Specific Purposes, as well as translation and interpreting. Given that CLASSLA-web.mk is the first general corpus of Macedonian, the workshop exclusively focused on that language. 

The introductory part contained basic information about CLASSLA web corpora, after which the convenor provided details on using corpus data in research of Macedonian. This was followed by tasks which allowed participants to query the CLASSLA-web.mk corpus. The exercises included queries which can be used in research of any South Slavic language, as well as those specifically pertaining to Macedonian, such as the definite and indefinite forms of nouns.

Throughout the workshop, the participants were encouraged to make suggestions on querying the corpus based on their own research interests. One of those was figurative language, which resulted in jointly creating queries involving idiomatic expressions containing a specific word, and searching for variations of proverbs. Hands-on tasks proved to be particularly useful to participants, as they may be used in both research and teaching.

Report from the CLASSLA-Express stop in Sofia

The fifth CLASSLA-Express workshop was held on June 26 in Sofia. The focus of this stop was on the usage of CLASSLA web corpora for Bulgarian. The workshop, convened by Jelena Parizoska and Petya Osenova, attracted 25 registered participants from various backgrounds – students of Bulgarian language and Slavic Philology, university lecturers of Bulgarian as a first and foreign language, also of Slavic Grammar, scientists from various institutes of the Bulgarian Academy of Sciences, such as the Institute of Balkan Studies and the Division of Bulgaria Encyclopedia.

First, the participants were introduced to the mission of CLARIN.SI, CLaDA-BG and the CLASSLA Knowledge Centre. Then, they were instructed how to use large Bulgarian corpora for targeted searches. The main portion of time was organized as hands-on sessions where the participants could perform their own searches and ask related questions.

The lessons learned are as follows: a) such trainings proved to be very much appreciated by all types of users – advanced ones as well as beginners; b) the preliminary preparation of the handouts and instructor’s copies is crucial for the smooth workflow; c) the scenario-oriented searches are a successful technique during trainings.

Report from the CLASSLA-Express stop in Ljubljana

On September 18, the CLASSLA-Express train made a stop in Ljubljana, marking the sixth workshop in the series. This workshop was part of the pre-conference program of the Slovenian Language Technologies and Digital Humanities Conference (JTDH 2024).

The workshop was convened by Jelena Parizoska and Ivana Filipović Petrović, with 15 participants in attendance. The presence of the creators of CLASSLA-web corpora – Nikola Ljubešić and Taja Kuzman – proved invaluable in discussions with participants, given that a number of questions were related to linguistic annotation and metadata enrichment, such as genre categories and MULTEXT-East part-of-speech tagsets. The questions posed to the convenors during discussions pertained to participants’ specific interests, e.g., searching for variations of idiomatic expressions in corpora and applying corpus data in lexicography.

The workshop was followed by another JTDH 2024 pre-conference event – a round table on the use of Large Language Models (LLMs) in corpus linguistic research – which was convened by Nikola Ljubešić, with Ivana Filipović Petrović, Jelena Parizoska and Taja Kuzman participating as panellists.

Report from the CLASSLA-Express stop in Kragujevac

On November 20, 2024, the CLASSLA-Express made its final stop for the year in Kragujevac, Serbia. The workshop, convened by Ivana Filipović Petrović, was attended by 11 participants, predominantly foreign language teachers from the Faculty of Philology and Arts. The workshop fostered lively discussions, focusing on participants’ specific areas of interest.

As with previous workshops, the session included both a theoretical introduction and a hands-on segment. The theoretical portion provided an overview of the CLASSLA Knowledge Centre and the tools available for South Slavic languages. The practical segment focused on querying the CLASSLA-web corpora on the CLARIN.SI NoSketch Engine concordancer, with participants tackling both basic and advanced queries. 

As we wrap up this year’s successful CLASSLA-Express journey, we invite everyone interested in corpus linguistics and LLMs to stay tuned for more stops and exciting plans next year.

November 2021: Workshop on regional markedness in text

On 6 and 7 November 2021, an online workshop dedicated to regional markedness in text took place, organised by the ReLDI centre, University of Zurich, and CLASSLA.

The program of the two-day event included the keynote talk on Computational dialectology by Yves Scherrer from the University of Helsinki, Darja Fišer’s presentation of the student research at the JTDH Language Technologies and Digital Humanities Conference, and two interactive workshops: Interactive workshop on regional variation in text, led by Sara Košutar, Larissa Schmidt, and Leyla Feiner, and Regional variation in gender marking: a hands-on tutorial on extracting data from corpora, led by Mirjana Starović and Tanja Samardžić.

The materials for the workshop on regional variation in gender marking are available here. They provide a gentle introduction to the process of analysing corpora, containing information on:

  • which South Slavic corpora are available on the CLARIN.SI repository, and how to find comparable corpora
  • how to explore corpora through the noSketchEngine and KonText concordancers
  • how to query the corpora using the CQL (Corpus Query Language) syntax
  • how to analyse gender marking in each South Slavic corpus by analysing the number of occurrences of feminine and masculine nouns describing occupations (e. g. the feminine and masculine nouns for the word “director”)
  • how to use the morphosyntactic descriptions (MSDs) to analyse the distribution of verbs with feminine and masculine suffixes (e.g. “mislila” vs. “mislil” for “she/he thought”)
  • and finally, how the results can be interpreted to analyse gender bias in society.

Some excerpts of the document are presented below:


May 2020: First CLASSLA K Centre workshop

The first CLASSLA K Centre workshop was supposed to be held from May 6 to May 8 2020 in Ljubljana, but due to the COVID-19 crisis, the face2face workshop had to be postponed. However, in the process of selecting participants a very nice crowd came together, so we decided to host an online Zoom session on May 6, the day we should have all met in Ljubljana.

The session took two hours, and in the first hour all the participants briefly presented themselves. In the second hour of the session, a short discussion on the future steps for the workshop and the knowledge centre were discussed, kick-started with the results on the survey taken by the participants before the online session. Also, the ReLDI centre for linguistic data was presented as well as the current CLARIN ERIC funding opportunities.

This discussion revealed the following priorities: (1) connecting web services with concordancers is a very sought feature, so that researchers could easily process and publish their textual raw data, (2) the Knowledge centre might need a form for reporting use cases on its resources (a draft of such a form has been made available here), and (3) the participants are very interested in holding group discussions on specific topics, which will be organised in the weeks to come.

The whole online session seemed to be a very pleasant experience for the 42 participants and we have the Zoom photos of the participants below to prove that!

We are still looking forward to the face2face workshop which we hope will take place during the next year.