This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please let us know on email@example.com, Subject “FAQ_Bulgarian”.
The questions in this FAQ are organised into three main sections:
1. Online Bulgarian language resources
Disclaimer: Please note that CLaDA-BG is in operation from 2018. During the second phase, 2019-2020, a dedicated repository hosting language resources and tools for Bulgarian is planned to be set in operation. It will then become possible for these resources and tools to be deposited to CLARIN-ERIC.
Q1.1: Where can I find Bulgarian dictionaries?
The main dictionary portals offered by CLaDA-BG partners or supported by CLaDA-BG are the following:
- the Bulgarian semantic WordNet lexicon BTB-WordNet (v1.0) which can be found at the Open Multilingual WordNet portal;
- the Multilingual Verb Valence Lexicon gives information on Bulgarian Verb Valency (please note that the search works only with transliterated verbs);
- specialized lexicon for Bulgarian and other languages can be used for the translation of specific IT domain expressions. The lexicons for Basque, Bulgarian, Czech, Dutch, English, Portuguese and Spanish were developed in the QTLeap project and can be downloaded from MetaShare.
Dictionaries by other providers:
The Institute for Bulgarian Language at the Bulgarian Academy of Sciences has provided a number of on-line dictionaries:
- Dictionary of Bulgarian Language
- Bulgarian Synonyms
- Bulgarian Antonyms
- Bulgarian Phraseology
- Bulgarian Neologisms
- Bulgarian Speech Corpora
Q1.2: How can I analyse Bulgarian corpora online?
Bulgarian corpora can be analysed on-line through the following platforms:
- WebClark concordancer by Institute of Information and Communication Technologies;
- the Concordancer by the Institute for Bulgarian Language at the Bulgarian Academy of Sciences;
- some Bulgarian corpora are available through the CLARIN.SI concordancers;
- the commercial Sketch Engine also offers access to several Bulgarian language corpora; for researchers from the EU, access to SketchEngine is free for non-commercial purposes in 2018-2022.
Q1.3: Which Bulgarian corpora can I analyse online?
Below we list the main corpora portals offered by CLaDA-BG partners or supported by CLaDA-BG:
- Bulgarian Reference Corpus
- Corpus of Bulgarian and Journalistic Speech
- Corpus of Culture for Giving for Education (CoDAR)
- Bilingual Bulgarian-Slovak parallel corpus
Corpora by other providers:
- Bulgarian National Corpus (BNC)
- The Bulgarian part of the multilingual parallel corpus of the EU law translation memory EU DGT-UD
- Bulgarian Web 2012 bgTenTen12, on the Sketch Engine, cf. Q1.2
Q1.4: What linguistic annotation schemas are used in Bulgarian corpora?
For word-level morphosyntactic annotation most corpora use the BulTreeBank tagset, which is based on the Bulgarian MULTEXT-East tagset. The description can be found in the reports BTB-TR03 and BTB-TR04 of the BulTreeBank project.
For syntactic annotation, two tagsets are used, the HPSG-based one and the Universal Dependencies project one. The Universal Dependencies project also contains a feature set for annotating morphosyntax.
Q1.5: Where can I download Bulgarian resources?
Bulgarian Resources can be downloaded from several places:
2. Tools to annotate Bulgarian texts
Q2.1: How can I perform basic linguistic processing of my Bulgarian texts?
The UDPipe tool also has a module for Bulgarian, which performs tokenization, morphosyntactic annotations and lemmatization (as well as dependency parsing).
The well-known TreeTagger also offers a module for tagging Bulgarian.
The best results including further annotation levels are currently achieved with the Bulgarian Linguistic Pipe based on the CLaRK system and its related trained models. The Pipe will be made publicly available shortly. The system can be downloaded and customised locally for various tasks. The site also contains a manual and demo examples. The CLaRK system provides a built-in tokenizer, however the morphosyntactic tagger and lemmatiser models are not available on-line.
To annotate your texts for these levels, please send your request as plain text to firstname.lastname@example.org. The service is free. It can be performed in two ways: either the text is provided to us and processed by CLaDA-BG, or a customized version of the pipeline will be provided with short training on how to use it. The complete pipeline consists of a tokenizer, sentence splitter, named entity recognizer, PoS and morphosyntactic tagger, lemmatiser, dependency parser and semantic parser. All these modules can be used together or as separate modules.
Q2.2: How can I standardize my texts prior to further processing?
Currently there are no technologies available for standardizing texts in Bulgarian.
Q2.3: How can I annotate my texts for named entities?
Currently the only NER tool for Bulgarian is grammar-based and it also relies on a gazetteer of names. To use it, please get in touch via email@example.com.
Q2.4: How can I syntactically parse my texts?
You can syntactically parse Bulgarian texts in the following ways:
- by using the offline dependency Bulgarian-specific parser which is a part of the CLaRK system, cf. Q2.1. To use it, please get in touch via firstname.lastname@example.org;
- by using the UDPipe tool which has off-the-shelf models for many languages, Bulgarian included, and uses the Universal Dependencies formalism.
3. Datasets to train Bulgarian annotation tools
Q3.1: Where can I get word embeddings for Bulgarian?
Embeddings are available from the FastText webpage.
Q3.2: What data is available for training a text normaliser for Bulgarian?
Currently there are no datasets available for training normalisers for Bulgarian.
Q3.3: What data is available for training a part-of-speech tagger for Bulgarian?
For training purposes you can use:
- the BulTreeBank (BTB) corpus that is part of the Universal Dependencies;
- pre-trained models TnT, SVMtool and Acpost taggers that are available at the BulTreeBank web.
Q3.4: What data is available for training a lemmatiser for Bulgarian?
The BulTreeBank (BTB), which is a part of the Universal Dependencies.
Q3.5: What data is available for training a named entity recogniser for Bulgarian?
For training the named entity recognizer of standard language the following resources can be used:
- the BulTreeBank (BTB) in the Universal Dependencies;
- the specially designed Corpus of Named Entities from the Shared task on Balto-Slavic languages 2019;
- the Bulgarian part of the Multilingual WSD/NER corpus from MetaShare.
Q3.6: What data is available for training a syntactic parser for Bulgarian?
The BulTreeBank (BTB) that is part of the Universal Dependencies.