Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije
Common Language Resources and Technology Infrastructure, Slovenia

FAQ for Bulgarian language resources and technologies

This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please let us know on helpdesk.classla@clarin.si, Subject “FAQ_Bulgarian”.

The questions in this FAQ are organised into three main sections:

1. Online Bulgarian language resources

Disclaimer: Please note that CLaDA-BG is in operation from 2018. During the second phase, 2019-2020, a dedicated repository hosting language resources and tools for Bulgarian is planned to be set in operation. It will then become possible for these resources and tools to be deposited to CLARIN-ERIC.

Q1.1: Where can I find Bulgarian dictionaries?

The main dictionary portals offered by CLaDA-BG partners or supported by CLaDA-BG are the following:

Dictionaries by other providers:

The Institute for Bulgarian Language at the Bulgarian Academy of Sciences has provided a number of on-line dictionaries:

Q1.2: How can I analyse Bulgarian corpora online?

Bulgarian corpora can be analysed on-line through the following platforms:

Q1.3: Which Bulgarian corpora can I analyse online?

Below we list the main corpora portals offered by CLaDA-BG partners or supported by CLaDA-BG:

Corpora by other providers:

Q1.4: What linguistic annotation schemas are used in Bulgarian corpora?

For word-level morphosyntactic annotation most corpora use the BulTreeBank tagset, which is based on the Bulgarian MULTEXT-East tagset. The description can be found in the reports BTB-TR03 and BTB-TR04 of the BulTreeBank project.

For syntactic annotation, two tagsets are used, the HPSG-based one and the Universal Dependencies project one. The Universal Dependencies project also contains a feature set for annotating morphosyntax.

Q1.5: Where can I download Bulgarian resources?

Bulgarian Resources can be downloaded from several places:


2. Tools to annotate Bulgarian texts

Q2.1: How can I perform basic linguistic processing of my Bulgarian texts?

The UDPipe tool also has a module for Bulgarian, which performs tokenization, morphosyntactic annotations and lemmatization (as well as dependency parsing).

The well-known TreeTagger also offers a module for tagging Bulgarian.

The best results including further annotation levels are currently achieved with the Bulgarian Linguistic Pipe based on the CLaRK system and its related trained models. The Pipe will be made publicly available shortly. The system can be downloaded and customised locally for various tasks. The site also contains a manual and demo examples. The CLaRK system provides a built-in tokenizer, however the morphosyntactic tagger and lemmatiser models are not available on-line.

To annotate your texts for these levels, please send your request as plain text to info@clada-bg.eu. The service is free. It can be performed in two ways: either the text is provided to us and processed by CLaDA-BG, or a customized version of the pipeline will be provided with short training on how to use it. The complete pipeline consists of a tokenizer, sentence splitter, named entity recognizer, PoS and morphosyntactic tagger, lemmatiser, dependency parser and semantic parser. All these modules can be used together or as separate modules.

Q2.2: How can I standardize my texts prior to further processing?

Currently there are no technologies available for standardizing texts in Bulgarian.

Q2.3: How can I annotate my texts for named entities?

Currently the only NER tool for Bulgarian is grammar-based and it also relies on a gazetteer of names. To use it, please get in touch via info@clada-bg.eu.

Q2.4: How can I syntactically parse my texts?

You can syntactically parse Bulgarian texts in the following ways:


3. Datasets to train Bulgarian annotation tools

Q3.1: Where can I get word embeddings for Bulgarian?

Embeddings are available from the FastText webpage.

Q3.2: What data is available for training a text normaliser for Bulgarian?

Currently there are no datasets available for training normalisers for Bulgarian.

Q3.3: What data is available for training a part-of-speech tagger for Bulgarian?

For training purposes you can use:

  • the BulTreeBank (BTB) corpus that is part of the Universal Dependencies;
  • pre-trained models TnT, SVMtool and Acpost taggers that are available at the BulTreeBank web.

Q3.4: What data is available for training a lemmatiser for Bulgarian?

The BulTreeBank (BTB), which is a part of the Universal Dependencies.

Q3.5: What data is available for training a named entity recogniser for Bulgarian?

For training the named entity recognizer of standard language the following resources can be used:

Q3.6: What data is available for training a syntactic parser for Bulgarian?

The BulTreeBank (BTB) that is part of the Universal Dependencies.

CLARIN.SI CLARIN CENTRE B K-CENTRE Data Seal of Approval OpenAIRE re3data_logo Open DOAR Open Archives

CLARIN.SI IS SUPPORTED SUPPORTED BY THE MINISTRY OF EDUCATION, SCIENCE AND SPORT UNDER THE PROGRAMME OF "EUROPEAN RESEARCH INFRASTRUCTURES".
Jožef Stefan Institute, 2014-2020. Your use of the CLARIN.SI website is subject to the CC BY License and our terms of use.