This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on firstname.lastname@example.org, Subject “FAQ_Serbian”.
The questions in this FAQ are organised into three main sections:
- Online Serbian language resources
- Tools to annotate Serbian texts
- Datasets to train Serbian annotation tools
- Where can I get word embeddings for Serbian?
- What data is available for training a text normaliser for Serbian?
- What data is available for training a part-of-speech tagger for Serbian?
- What data is available for training a lemmatiser for Serbian?
- What data is available for training a named entity recogniser for Serbian?
- What data is available for training a syntactic parser for Serbian?
Online Serbian language resources
Q1.1 Where can I find Serbian dictionaries?
Below we list the main lexical resources:
- Raskovnik is a dictionary portal aimed at providing access to historical dictionaries of Serbian language, currently consisting of five dictionaries with around 84 thousand entries altogether
- Lexicom gives a search interface to an inflectional lexicon of Serbian language
- srLex is the largest inflectional lexicon of Serbian language, consisting of 192,590 lexemes and 6,908,043 entries; it is searchable through the CLARIN.SI web interface (Anonymous login, Lexicon)
Q1.2: How can I analyse Serbian corpora online?
CLARIN.SI offers access to two concordancers, which share the same set of (Serbian) corpora and back-end, but have different front-ends:
- NoSketch Engine, an open-source variant of the well-known Sketch Engine. No registration is necessary or possible, which also has drawbacks, e.g. not being able to save your screen settings or making private subcorpora.
- Kontext, with a somewhat different user interface. Basic functionality is provided without logging in, but to use more advanced functionalities, it is necessary to log in via AAI through you identity provider.
Documentation on how to query corpora via the SketchEngine-like interfaces is available here.
Note that the commercial Sketch Engine also offers access to several Serbian language corpora. Furthermore, for researchers in the EU, access to SketchEngine is free for non-commerical purposes in 2018-2022.
Q1.3: Which Serbian corpora can I analyse online?
These are the main general language corpora:
- The largest corpus of Serbian language is the Serbian web corpus srWaC (554 million words), which you can query via noSkE or KonText.
- The Corpus of Contemporary Serbian (SrpKor2013) of the Faculty of Mathematics, University of Belgrade is 122 million tokens large and consists of literary, scientific, administrative and general texts. It is available for search through the IMS corpus workbench, but it requires authentication (details can be found here).
The main specialised corpora are the following:
- The only available specialised corpus of Serbian language we are aware of is the Serbian Corpus of Early Child Language (SCECL), and it can be downloaded and browsed through via TalkBank.
Finally, the main manually annotated corpora are the following:
- The training corpus of standard language (SETimes.SR) is available through noSkE or KonText
- The training corpus of computer-mediated communication (ReLDI-NormTagNER-sr) is available through NoSkE and KonText.
Q1.4: What linguistic annotation schemas are used in Serbian corpora?
Most of these corpora are annotated on the level of morphosyntax with the MULTEXT East tagset, either for Bosnian (srWaC, SETimes.SR, ReLDI-NormTagNER-sr) or Serbian (SrpKor2013). On the level of syntax, most corpora are annotated with the schema from the Universal Dependencies project. The Universal Dependencies project already contains a tagset for annotating morphosyntax, which is currently applied in the SETimes.SR training corpus. Named entities are annotated via the guidelines developed for South Slavic languages.
Q1.5 Where can I download Serbian resources?
The main point for archiving and downloading Serbian language resources is the repository of CLARIN.SI.
Another point for downloading resources in Serbian is the MetaShare repository.
Tools to annotate Serbian texts
Q2.1 How can I perform basic linguistic processing of my Serbian texts?
- Tokenisation, part-of-speech tagging and lemmatisation on your texts can be done via the CLARIN.SI services. The documentation for using the services, either via a web interface, or as a web service, is available here. You can also install the same tools locally, namely the tokenizer and part-of-speech tagger and lemmatizer.
Q2.2 How can I standardize my texts prior to further processing?
- Currently, the only text on-line normalization tool available through the CLARIN.SI services is the REDI diacritic restorer. The usage of the CLARIN.SI services is documented here. You can also download this REDI diacritic restorer, install it and use it locally.
- For word-level normalisation of user-generated Serbian texts you can download and install the CSMTiser text normalizer.
Q2.3 How can I annotate my texts for named entities?
- On-line NER is available via the CLARIN.SI services documented here. You can also download this NER tool and use it locally.
Q2.4 How can I syntactically parse my texts?
You can syntactically parse Serbian texts in multiple ways:
- by using the CLARIN.SI services (Universal Dependencies formalism)
- by using the UDPipe tool which has off-the-shelf models for many languages, Serbian included (Universal Dependencies formalism)
Datasets to train Serbian annotation tools
Q3.1 Where can I get word embeddings for Serbian?
The embeddings trained on the srWaC web corpus is the CLARIN.SI-embed.sr embedding collection.
There are also collections of trained embeddings for Serbian available from fastText (Latin and Cyrillic script are intertwined and no transliteration was performed).
If you want to train your own embeddings, the largest freely available collection of Serbian texts is the srWaC corpus.
Q3.2 What data is available for training a text normaliser for Serbian?
For training text normalisers for Internet Croatian the ReLDI-NormTagNER-sr dataset can be used.
Q3.3 What data is available for training a part-of-speech tagger for Serbian?
Q3.4 What data is available for training a lemmatiser for Serbian?
Q3.5 What data is available for training a named entity recogniser for Serbian?
Q3.6 What data is available for training a syntactic parser for Serbian?
If you require additional annotation layers, e.g., for multi-task learning, the SETimes.SR dataset should be used.