Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije
Common Language Resources and Technology Infrastructure, Slovenia

FAQ for Bulgarian language resources and technologies

This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please let us know on helpdesk.classla@clarin.si, Subject “FAQ_Bulgarian”.

The questions in this FAQ are organised into three main sections:

1. Online Bulgarian language resources

Disclaimer: Please note that CLaDA-BG is in operation from 2018. During the second phase, 2019-2020, a dedicated repository hosting language resources and tools for Bulgarian is planned to be set in operation. It will then become possible for these resources and tools to be deposited to CLARIN-ERIC.

Q1.1: Where can I find Bulgarian dictionaries?

The main dictionary portals offered by CLaDA-BG partners or supported by CLaDA-BG are the following:

Dictionaries by other providers:

The Institute for Bulgarian Language at the Bulgarian Academy of Sciences has provided a number of on-line dictionaries:

Q1.2: How can I analyse Bulgarian corpora online?

Bulgarian corpora can be analysed on-line through the following platforms:

Q1.3: Which Bulgarian corpora can I analyse online?

Below we list the main corpora portals offered by CLaDA-BG partners or supported by CLaDA-BG:

Corpora by other providers:

Q1.4: What linguistic annotation schemas are used in Bulgarian corpora?

For word-level morphosyntactic annotation, most corpora use the BulTreeBank tagset, which is based on the Bulgarian MULTEXT-East tagset. The description can be found in the reports BTB-TR03 and BTB-TR04 of the BulTreeBank project.

For syntactic annotation, two tagsets are used, the HPSG-based one and the Universal Dependencies project one. The Universal Dependencies project also contains a feature set for annotating morphosyntax.

Q1.5: Where can I download Bulgarian resources?

Bulgarian resources can be downloaded from several places:

In addition to the resources mentioned above and below, the CLARIN.SI repository offers:


2. Tools to annotate Bulgarian texts

Q2.1: How can I perform basic linguistic processing of my Bulgarian texts?

The UDPipe tool also has a module for Bulgarian, which performs tokenisation, morphosyntactic annotations and lemmatisation (as well as dependency parsing).

The well-known TreeTagger also offers a module for tagging Bulgarian.

The best results including further annotation levels are currently achieved with the Bulgarian Linguistic Pipe based on the CLaRK system and its related trained models. The Pipe will be made publicly available shortly. The system can be downloaded and customised locally for various tasks. The site also contains a manual and demo examples. The CLaRK system provides a built-in tokeniser, however the morphosyntactic tagger and lemmatiser models are not available on-line.

To annotate your texts for these levels, please send your request as plain text to info@clada-bg.eu. The service is free. It can be performed in two ways: either the text is provided to us and processed by CLaDA-BG, or a customized version of the pipeline will be provided with short training on how to use it. The complete pipeline consists of a tokeniser, sentence splitter, named entity recogniser, PoS and morphosyntactic tagger, lemmatiser, dependency parser and semantic parser. All these modules can be used together or as separate modules.

In addition to this, the state-of-the-art CLASSLA pipeline provides processing of standard Bulgarian on the levels of tokenisation and sentence splitting, part-of-speech tagging, lemmatisation, dependency parsing, and named entity recognition. For Bulgarian, the CLASSLA pipeline uses the rule-based reldi-tokeniser. There are also available off-the-shelf models for lemmatisation and part-of-speech tagging of standard Bulgarian. The documentation for the installation and use of the pipeline is available here. You can try out the pipeline at the CLASSLA Annotation tool website.

Q2.2: How can I standardize my texts prior to further processing?

Currently there are no technologies available for standardizing texts in Bulgarian.

Q2.3: How can I annotate my texts for named entities?

Named entity recognition is provided by the CLASSLA pipeline, which also offers an off-the shelf model for standard Bulgarian.

Alternatively, there is a NER tool for Bulgarian that is grammar-based and it also relies on a gazetteer of names. To use it, please get in touch via info@clada-bg.eu.

Q2.4: How can I syntactically parse my texts?

You can syntactically parse Bulgarian texts in the following ways:


3. Datasets to train Bulgarian annotation tools

Q3.1: Where can I get word embeddings for Bulgarian?

Q3.2: What data is available for training a text normaliser for Bulgarian?

Currently there are no datasets available for training normalisers for Bulgarian.

Q3.3: What data is available for training a part-of-speech tagger for Bulgarian?

For training purposes you can use:

  • the BulTreeBank (BTB) corpus that is part of the Universal Dependencies;
  • pre-trained models TnT, SVMtool and Acpost taggers that are available at the BulTreeBank web.

Q3.4: What data is available for training a lemmatiser for Bulgarian?

The BulTreeBank (BTB), which is a part of the Universal Dependencies.

Q3.5: What data is available for training a named entity recogniser for Bulgarian?

For training the named entity recogniser of standard language, the following resources can be used:

Q3.6: What data is available for training a syntactic parser for Bulgarian?

The BulTreeBank (BTB) that is part of the Universal Dependencies.