LLMs4SSH: Knowledge Centre for Large Language Models in SS&H

	Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije Common Language Resources and Technology Infrastructure, Slovenia

CLARIN.SI is a member of LLMs4SSH, the CLARIN K-centre for Large Language Models for Social Sciences and Humanities. The LLMs4SSH Centre offers expertise on various applications of LLMs in processing language data and on expansion and adaptation of LLMs to the needs of researchers from Social Sciences and Humanities.

On this page, we provide the key information on current activities related to large language models (LLMs) in Slovenia.

Contents hide

1 Key Projects Focusing on LLMs

2 Benchmarks for LLMs in Slovenian

3 Benchmarks and Datasets for LLM evaluation in South Slavic Languages

4 Large Language Models and Other Language Technologies for Slovenian

5 Contact Us and Stay Updated

6 The CLARIN.SI Team in LLMs4SSH

Key Projects Focusing on LLMs

PoVeJMo (Adaptive Natural Language Processing with Large Language Models): This national project is developing the first large language models specifically tailored to the Slovenian language. The resulting models are openly available as GaMS models. Inside the project, they will serve as the foundation for advanced applications in the fields of medicine, humanities, industrial environments, and software development.
LLM4DH (Large Language Models for Digital Humanities): This national project focuses on extensive evaluation and benchmarking of LLMs for Slovenian, their application to research in humanities fields (linguistics and lexicography, education, contemporary history, folkloristics, and law), and development of visual LLMs for Slovenian.
AI4DH (Centre of Excellence in Artificial Intelligence for Digital Humanities): This EU-funded project aims to establish the University of Ljubljana (Slovenia) as a leading institution in Europe for AI applications in digital humanities (DH). The project will set up a Centre of Excellence that combines top-tier AI research with support for DH scholars, enhancing their ability to leverage AI opportunities.
ALT-EDIC4EU (Alliance for Language Technologies for the European Union): This EU-funded project aims to facilitate the development of a robust and scalable infrastructure and operations for the Alliance for Language Technologies (ALT-EDIC), in order to support the federation of the European Language Technology ecosystem. The project involves experts, institutions and industries from strategic domains, including the Jožef Stefan Institute from Slovenia.
LLMs4EU (Large Language Models for the European Union): This EU-funded project aims to establish a one-stop shop for language data to generate value for developers of LLMs, a cutting-edge platform for the transparent evaluation and benchmarking of LLMs in European languages, and to develop language models tailored to specific languages, sectors, and use cases from diverse application domains (energy, telecom, tourism, public services, and science). The project is carried out by a broad consortium of leading research centres and companies specializing in language data management, LLMs and language technologies, with some of the core partners coming from Slovenia.

Benchmarks for LLMs in Slovenian

The following benchmarks enable evaluation of large language models in Slovenian language:

SloBENCH (Slovenian NLP Benchmark): This evaluation platform enables benchmarking the Slovenian natural language processing technologies on the following tasks: natural language inference (NLI), machine translation (between English and Slovenian), speech recognition (ASR), named entity recognition (NER) and dependency parsing. It also includes the Slovenian Winograd Schema Challenge (WSC) dataset and SuperGLUE benchmarks.

Slovenian LLM eval: Set of benchmarks (ARC Challenge, ARC Easy, BoolQ, HellaSwag, NQ Open, OpenBookQA, PIQA, TriviaQA, Winogrande) for evaluating Slovenian language models, building upon the work of Aleksa Gordić who translated some of the popular English benchmarks into Slovenian via machine translation. The authors have further improved the quality of these automatic Slovenian translations.

Benchmarks and Datasets for LLM evaluation in South Slavic Languages

CLARIN.SI also works intensively on various South Slavic languages via its CLASSLA Knowledge Centre for South Slavic languages. As part of that work, a series of benchmarks have been developed for numerous languages. We list the most prominent ones:

BENCHić benchmarking platform for Croatian, Serbian, Bosnian, and Macedonian, covering named entity recognition (NER), sentiment identification, commonsense reasoning and language identification.

DIALECT-COPA: commonsense reasoning in South Slavic languages and dialects (Slovenian, Cerkno, Croatian, Chakavian, Serbian, Torlak, Macedonian)
IPTC news topic classification (Slovenian, Croatian, Greek, Catalan)
AGILE benchmark on text genre identification (Slovenian, Croatian, Macedonian, English, Albanian, Catalan, Greek, Icelandic, Maltese, Turkish, and Ukrainian)
UniversalNER benchmark for many languages, including Croatian and Serbian
ParlaSent sentiment identification dataset in parliamentary debates (Slovenian, Croatian, Bosnian, Serbian, Czech, Slovak, English)
ParlaPause benchmark on filled pause detection in speech (Slovenian, Croatian, Serbian, Czech, Polish)
Mak Na Konac automatic speech recognition benchmark for Croatian and Serbian
Mići Princ automatic speech recognition benchmark for the Chakavian dialect of Croatian

For an overview of freely-available datasets, including general text collections, and training and test datasets for various NLP tasks, see the Frequently-Asked Questions (FAQ) for Slovenian, Croatian, Serbian, Bulgarian and Macedonian language, curated by the CLASSLA Knowledge Centre. The FAQ also provides information about resources and technologies for linguistic annotation of South Slavic texts.

Large Language Models and Other Language Technologies for Slovenian

The main sites where you can find open-source large language models and language technologies for Slovenian are:

The overview of openly-available large language models, speech technologies, and other natural-language processing (NLP) technologies for Slovenian language, curated by the CLASSLA Knowledge Centre, provides information on openly-available technologies for Slovenian, benchmarks and papers on:

Contact Us and Stay Updated

If you have any questions related to large language models, language technologies or language resources, the CLASSLA Knowledge Centre has a helpdesk dedicated to these topics for South Slavic languages. It can be contacted via helpdesk.classla@clarin.si. The helpdesk offers additional clarifications regarding the CLASSLA documentation and support in using, modifying, producing, or publishing resources and technologies for South Slavic languages.

You can subscribe to the CLASSLA mailing list here to be informed of new resources, technologies, events and projects for Slovenian and other South Slavic languages.

Stay updated on the latest activities of the CLASSLA Knowledge Centre and the CLARIN.SI infrastructure which are the Slovenian members of the LLMs4SSH Knowledge Centre by following:

CLARIN.SI on X and LinkedIn
the Discord group “Slovenska skupnost za jezikovne vire in tehnologije”

The CLARIN.SI Team in LLMs4SSH

Main CLARIN.SI members that participate in LLMs4SSH Knowledge Centre are Simon Krek, Nikola Ljubešić, Špela Vintar, and Taja Kuzman Pungeršek.

Dr Simon Krek is a researcher from the Department for Artificial Intelligence at the Jožef Stefan Institute and the head of the Centre for Language Resources and Technologies (CJVT) at the University of Ljubljana, Slovenia. His research fields are lexicography and lexicogrammar, corpus linguistics, natural language processing, language technology infrastructure and computer-aided language learning and teaching. He has coordinated major Slovenian projects for language technologies (cf. Communication in Slovene and Development of Slovene in a Digital Environment). In addition to participating in numerous European projects (META-NET, xLike and others), he has led the H2020-funded ELEXIS project (European Lexicographic Infrastructure). He is currently leading the PoVeJMo project that is focused on developing large language models for Slovenian, and is involved in several other major projects related to language technologies and large language models, including the MEZZANINE project for Slovenian speech technologies, LLM4DH, ALT-EDIC4EU and LLMs4EU projects. He also serves as a deputy national coordinator of the CLARIN.SI infrastructure.

Dr Nikola Ljubešić and Taja Kuzman Pungeršek are researchers from the Department of Knowledge Technologies at the Jožef Stefan Institute, Slovenia. Their research interests cover a broad spectrum of natural language processing (NLP) tasks including web corpus construction (cf., MaCoCu project and CLASSLA-web corpora), development of tools for automatic linguistic annotation for South Slavic languages (cf. CLASSLA-Stanza pipeline), specialization of language models for under-resourced languages (cf., BERTić and XLM-R-BERTić models), development of speech corpora and automatic speech recognition models (cf. ParlaSpeech corpora and MEZZANINE project), benchmarking South Slavic NLP technologies, application of NLP methods to South Slavic dialects (cf. VarDial DIALECT-COPA shared task), and machine learning tasks, including hate speech detection (cf. IMSyPP project), topic detection, and automatic genre identification. They are the leaders of the CLASSLA centre (CLARIN Knowledge Centre for South Slavic languages), which offers expertise on language resources and technologies for South Slavic languages, and members of CLARIN.SI management committee.

Dr Špela Vintar is a researcher at the Centre of Network Infrastructure at the Jožef Stefan Institute and full professor at the Department of Translation Studies, Faculty of Arts, University of Ljubljana. Her research interests span various areas of digital linguistics and language processing, including terminology and knowledge mining, where she was the leader of TermFrame which created a multilingual frame-based knowledge base, and a researcher at the JANES project exploring terminology in non-standard Slovenian; machine translation (involvement in the Development of Slovene in the Digital Environment project); sign language, where she was the leader of SIGNOR, and more recently also cognitive approaches to semantics and language modelling by heading the SWOW-SL word association collection, and the evaluation and benchmarking of LLMs within the LLM4DH project, where she explores nuanced language and bias. She is the founder and coordinator of the Joint Master in Digital Linguistics, established on the basis of an awarded KA2-Erasmus+ project DigiLing: Trans-European e-learning hub for Digital Linguistics.