FAQ for Serbian language resources and technologies

	Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije Common Language Resources and Technology Infrastructure, Slovenia

This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on helpdesk.classla@clarin.si, Subject “FAQ_Serbian”.

The questions in this FAQ are organised into three main sections:

1. Online Serbian language resources

Q1.1: Where can I find Serbian dictionaries?

Q1.2: How can I analyse Serbian corpora online?

Q1.3: Which Serbian corpora can I analyse online?

Q1.4: What linguistic annotation schemas are used in Serbian corpora?

Q1.5: Where can I download Serbian resources?

2. Tools to annotate Serbian texts

Q2.1: How can I perform basic linguistic processing of my Serbian texts?

Q2.2: How can I standardize my texts prior to further processing?

Q2.3: How can I annotate my texts for named entities?

Q2.4: How can I syntactically parse my texts?

3. Datasets to train Serbian annotation tools

Q3.1: Where can I get word embeddings or pre-trained language models for Serbian?

Q3.2: What data is available for training a text normaliser for Serbian?

Q3.3: What data is available for training a part-of-speech tagger for Serbian?

Q3.4: What data is available for training a lemmatiser for Serbian?

Q3.5: What data is available for training a named entity recogniser for Serbian?

Q3.6: What data is available for training a syntactic parser for Serbian?

1. Online Serbian language resources

Q1.1: Where can I find Serbian dictionaries?

Below we list the main lexical resources:

Srpski jezik dictionary portal includes general dictionaries, orthographic and terminological dictionaries, including SASA, the official dictionary of the Serbo-Croatian language published by the Serbian Academy of Sciences and Arts. They can be accessed upon registration.
štaZnači.com is an online dictionary which is freely accessible
Raskovnik is a dictionary portal aimed at providing access to historical dictionaries of Serbian language. Registered users can access five dictionaries with around 84 thousand entries altogether.
Digital Archive of the Serbian Academy of Sciences and Arts (DAIS) provides open access to some digitised specialised dictionaries
Kontekst.io is a lexicon of semantically related words, automatically produced on the basis of word-embeddings from large corpora
Vikirečnik, the Serbian Wiktionary, is a multilingual, openly accessible, and openly editable dictionary
Morphological electronic dictionary of Serbian (Ekavian pronunciation) and the Serbian WordNet SrpWN can be requested via the MetaShare repository
srLex is the largest inflectional lexicon of Serbian language, consisting of 192,590 lexemes and 6,908,043 entries; it is searchable through the CLARIN.SI web interface (Anonymous login, Lexicon)

Q1.2: How can I analyse Serbian corpora online?

CLARIN.SI offers access to four concordancers, which share the same set of corpora and back-end, but have different front-ends:

CLARIN.SI Crystal noSketch Engine, an open-source variant of the well-known Sketch Engine. Instructions for its use are available here. CLARIN.SI offers two installations of Crystal noSketch Engine: an open installation (no log-in, which simplifies use for less advanced users) and a version with log-in which allows subcorpus creation and personalised display of e.g. corpus attributes.
KonText, with a somewhat different user interface. Basic functionality is provided without logging in, but to use more advanced functionalities, it is necessary to log in via your home institution.
CLARIN.SI Bonito noSketch Engine is the old version of noSketch Engine with a radically different user interface from Crystal. This version offers some functions that the new noSketch Engine does not, in particular, accessing the results of queries in XML, where it is enough to add the parameter “format=XML” to the end of the query URL.

Documentation on how to query corpora via the SketchEngine-like interfaces is available here.

Note that the commercial Sketch Engine also offers access to several Serbian language corpora, as well as some additional tools that are not accessible on the free NoSketch Engine, including the tools to analyse collocations (Word sketches), synonyms and antonyms (Thesaurus), and the tools to compute frequency lists of multiword expressions (N-grams). It also allows users to create their own corpora.

Q1.3: Which Serbian corpora can I analyse online?

For a complete list of corpora available under CLARIN.SI concordancers, see the index for Crystal noSkE, Bonito noSkE or KonText. Below we list some of the important ones, with links to the Crystal noSketch Engine concordancer:

general language corpora that are available under CLARIN.si concordancers are the web corpora CLASSLA-web.sr (2.9 billion tokens), PDRS (715 million tokens) and srWaC (554 million tokens)
specialized corpora include the parliamentary corpora ParlaMint-RS and yu1Parl, the parliamentary spoken corpus ParlaSpeech-RS, the Wikipedia corpus for Serbian, CLASSLAWiki-sr, and Serbo-Croatian, CLASSLAWiki-sh, the corpus of tweets Tweet-sr, the Corpus of Serbian forms of address, and the Corpus of Torlak dialect transcriptions
manually annotated corpora include the training corpus of standard language SETimes.SR and the training corpus of computer-mediated communication ReLDI-sr with manually normalised (standardised), morphosyntactically tagged and lemmatised words and named entities
parallel corpora include the multilingual European parliamentary corpora ParlaMint-XX, paired with the machine-translated English corpora ParlaMint-XX-en

In addition to these, the general Corpus of Contemporary Serbian SrpKor2013 (122 million tokens) of the Faculty of Mathematics, University of Belgrade, which consists of literary, scientific, administrative, and general texts, is available for search through the IMS corpus workbench, but it requires authentication. Another specialised corpus, the Serbian Corpus of Early Child Language (SCECL), can be downloaded and browsed via TalkBank.

Furthermore, the commercial Sketch Engine includes some Serbian corpora in Latin and Cyrillic script.

Q1.4: What linguistic annotation schemas are used in Serbian corpora?

Most of these corpora are annotated according to the MULTEXT-East morphosyntactic specifications. The more recent ones use the Version 6 specifications for the Serbo-Croatian macrolanguage. More recent corpora also use the Universal Dependencies project annotation scheme, in particular that for Croatian and Serbian. Named entities are annotated via the Janes NE guidelines.

Q1.5: Where can I download Serbian resources?

The main point for archiving and downloading Serbian language resources is the repository of CLARIN.SI.

In addition to the resources mentioned above and below, the repository offers:

manually annotated corpora and datasets, including the sentiment-annotated Twitter corpora, the multilingual sentiment dataset of parliamentary debates ParlaSent, the post-edited and error annotated machine translation corpus PErr, and the choice of plausible alternatives datasets COPA-SR in Serbian and DIALECT-COPA in the Torlak dialect from southeastern Serbia;
parallel corpora, including the Serbian-English parallel corpora MaCoCu-sr-en and srenWaC;
wordlists and other lexical resources, including the automatically constructed multiword lexicon srMWELex, and the databases of the Western South Slavic verb HyperVerb and WeSoSlaV;
other corpora and datasets, including the largest Serbian corpus – the web corpus MaCoCu-sr with 2.5 billion words, also available as a genre-enriched version inside the MaCoCu-Genre corpus collection, the linguistically annotated corpus of parliamentary debates ParlaMint.ana, the text collection for training the BERTić transformer model BERTić-data, the corpus of Legislation texts of Republic of Serbia, the automatic speech recognition training dataset JuzneVesti-SR, the multilingual news sentiment analysis dataset SADEmma, the news dataset SETimes.HBS and the Twitter dataset Twitter-HBS for discriminating between Bosnian, Croatian, Montenegrin and Serbian.

Another point where you can find Serbian resources is the European Language Grid platform. It includes the public mining corpus RudKorP, the lexicon of sentiment-scored words SentiWords.SR, and the morphological dictionaries SrpMD.

Additional Serbian resources are available at the Hugging Face profile of the Serbian Language Resources and Technologies Society (JERTEH), including the corpus of old Serbian novels SrpELTeC, the news corpus SrpKorNews, the parallel Serbian-English corpus of doctoral dissertation abstracts PaSaž, and the named entity recognition training corpus SrpELTeC-gold-NER. Another dataset, available on the Hugging Face repository, is the Serbian scientific corpus STARS.

Furthermore, a list of some additional resources has been compiled by the Regional Linguistic Data Initiative ReLDI. It includes the paraphrase corpus paraphrase.sr, the movie review dataset SerbMR, constructed for the task of sentiment analysis, and the sentiment analysis dataset of comments SentiComments.SR.

2. Tools to annotate Serbian texts

Q2.1: How can I perform basic linguistic processing of my Serbian texts?

The state-of-the-art CLASSLA-Stanza pipeline provides processing of standard and non-standard (Internet) Serbian on the levels of tokenisation and sentence splitting, part-of-speech tagging, lemmatisation, dependency parsing, and named entity recognition. For Serbian, the CLASSLA-Stanza pipeline uses the rule-based reldi-tokeniser. There are also available off-the-shelf models for lemmatisation of standard and non-standard Serbian, and part-of-speech tagging of standard and non-standard Serbian. You can try out the pipeline at the CLASSLA Annotation tool website.

The documentation for the installation and use of the pipeline is available here.

In addition to this, tokenisation, part-of-speech tagging, and lemmatisation are provided by a CLARIN.SI service ReLDIanno as well. This is a legacy system for linguistic annotation that we still keep available for backward compatibility, but we suggest new users to use the above-mentioned CLASSLA-Stanza pipeline.

Q2.2: How can I standardize my texts prior to further processing?

The CLASSLA-Stanza pipeline, mentioned above, includes also models for processing of non-standard text, which allows non-standard texts to be annotated before previous standardization.

Currently, the only on-line text normalisation tool available through the CLARIN.SI services (ReLDIanno) is the REDI diacritic restorer. Its usage is documented here. You can also download it, install it and use it locally.

For word-level normalisation of user-generated Serbian texts, you can download and install the CSMTiser text normaliser.

Q2.3: How can I annotate my texts for named entities?

Named entity recognition is provided by the CLASSLA-Stanza pipeline, which also offers off-the shelf models for standard and non-standard Serbian.

In addition to this, on-line NER is available via the CLARIN.SI service ReLDIanno. You can also download the janes-ner tool.

Q2.4: How can I syntactically parse my texts?

You can syntactically parse Serbian texts, following the Universal Dependencies formalism, in multiple ways:

by using the state-of-the-art CLASSLA-Stanza pipeline, which also offers an off-the-shelf model
by using the CLARIN.SI service ReLDIanno
by using the UDPipe tool, which has off-the-shelf models for many languages, Serbian included

3. Datasets to train Serbian annotation tools

Q3.1: Where can I get word embeddings or pre-trained language models for Serbian?

The embeddings trained on the srWaC and MaCoCu-sr web corpora is the CLARIN.SI-embed.sr embedding collection.
There are also collections of trained embeddings for Serbian available from fastText (Latin and Cyrillic script are intertwined and no transliteration was performed).
If you want to train your own embeddings, the largest freely available collection of Serbian texts is the BERTić-data text collection.

You can also use a transformer language model BERTić, a state-of-the-art model representing words/tokens as contextually dependent word embeddings. It allows you to extract word embeddings for every word occurrence, which can then be used in training a model for an end task.

Q3.2: What data is available for training a text normaliser for Serbian?

For training text normalisers for Internet Serbian, the ReLDI-NormTagNER-sr dataset can be used, a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian.

Q3.3: What data is available for training a part-of-speech tagger for Serbian?

The reference dataset for training a standard tagger is SETimes.SR. There is also the ReLDI-NormTagNER-sr training dataset of Internet Serbian.

You can also use the CLASSLA-Stanza pipeline in combination with the CLARIN.SI embeddings and the training dataset SETimes.SR to train and evaluate your own part-of-speech tagger. The documentation is available here.

Q3.4: What data is available for training a lemmatiser for Serbian?

Lemmatisers can be trained either on the tagger training data (SETimes.SR, ReLDI-NormTagNER-sr, see the section on PoS tagger training for details) and/or on the inflectional lexicon srLex.

For training your own lemmatiser for standard and non-standard Serbian, you can use the CLASSLA-Stanza pipeline, which uses the external lexicon for lemmatisation (srLex). The documentation is available here.

Q3.5: What data is available for training a named entity recogniser for Serbian?

For training a named entity recogniser of standard language, SETimes.SR is the best resource. For training NER systems for online, non-standard texts, ReLDI-NormTagNER-sr can be used.

The CLASSLA-Stanza pipeline allows you to train your own named entity recogniser as well. The documentation is available here.

Q3.6: What data is available for training a syntactic parser for Serbian?

If you want to follow the Universal Dependencies formalism for dependency parsing, the best location for obtaining training data is the Universal Dependencies repository.

If you require additional annotation layers, e.g., for multi-task learning, the SETimes.SR dataset should be used.

You can also use the CLASSLA-Stanza pipeline to train your own parser. The documentation is available here.

1. Online Serbian language resources

Q1.1: Where can I find Serbian dictionaries?

Q1.2: How can I analyse Serbian corpora online?

Q1.3: Which Serbian corpora can I analyse online?

Q1.4: What linguistic annotation schemas are used in Serbian corpora?

Q1.5: Where can I download Serbian resources?

2. Tools to annotate Serbian texts

Q2.1: How can I perform basic linguistic processing of my Serbian texts?

Q2.2: How can I standardize my texts prior to further processing?

Q2.3: How can I annotate my texts for named entities?