This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please let us know on helpdesk.classla@clarin.si, Subject “FAQ_Bulgarian”.
The questions in this FAQ are organised into three main sections:
1. Online Bulgarian language resources
Disclaimer: Please note that CLaDA-BG is in operation from 2018. During the second phase, 2019-2020, a dedicated repository hosting language resources and tools for Bulgarian is planned to be set in operation. It will then become possible for these resources and tools to be deposited to CLARIN-ERIC.
Q1.1: Where can I find Bulgarian dictionaries?
The main dictionary portals offered by CLaDA-BG partners or supported by CLaDA-BG are the following:
- the Bulgarian semantic WordNet lexicon BTB-WordNet;
- specialized lexicon for Bulgarian and other languages can be used for the translation of specific IT domain expressions. The lexicons for Basque, Bulgarian, Czech, Dutch, English, Portuguese and Spanish were developed in the QTLeap project and can be downloaded from the PORTULAN CLARIN repository.
Dictionaries by other providers:
The Institute for Bulgarian Language at the Bulgarian Academy of Sciences has provided a number of on-line dictionaries:
- BulNet
- Dictionary of Bulgarian Language
- Bulgarian Synonyms
- Bulgarian Antonyms
- Bulgarian Phraseology
- Bulgarian Neologisms
- Bulgarian Speech Corpora
Q1.2: How can I analyse Bulgarian corpora online?
Bulgarian corpora can be analysed on-line through the following platforms:
- WebClark concordancer by Institute of Information and Communication Technologies;
- the Concordancer by the Institute for Bulgarian Language at the Bulgarian Academy of Sciences;
- some Bulgarian corpora are available through the CLARIN.SI concordancers, i.e. CLARIN.SI Crystal noSketch Engine (an open version without log-in and a version with log-in which allows subcorpus creation and personalised display of e.g. corpus attributes), CLARIN.SI Bonito noSketch Engine and KonText, which share the same set of corpora and back-end, but have different front-ends;
- the commercial Sketch Engine also offers access to several Bulgarian language corpora, as well as some additional tools, including the tools to analyse collocations (Word sketches), synonyms and antonyms (Thesaurus), and the tools to compute frequency lists of multiword expressions (N-grams). It also allows users to create their own corpora.
Q1.3: Which Bulgarian corpora can I analyse online?
Below we list the main corpora portals offered by CLaDA-BG partners or supported by CLaDA-BG:
- Bulgarian Reference Corpus
- Corpus of Bulgarian and Journalistic Speech
- Corpus of Culture for Giving for Education (CoDAR)
- Bilingual Bulgarian-Slovak parallel corpus
Corpora by other providers:
- Bulgarian National Corpus (BNC)
- the following Bulgarian corpora are available under CLARIN.SI concordancers (Crystal noSkE, Bonito noSkE and KonText): the web corpus CLASSLA-web.bg (3.9 billion tokens), the parliamentary corpus ParlaMint-BG, the Wikipedia corpus CLASSLAWiki-bg, the Bulgarian part of the multilingual parallel corpus of the EU law translation memory EU DGT-UD, and the Bulgarian part of multilingual European parliamentary corpora ParlaMint-XX, paired with the machine-translated English corpora ParlaMint-XX-en
- the commercial Sketch Engine includes the following Bulgarian corpora: EUR-Lex Bulgarian 2/2016, OPUS2 Bulgarian, which is a part of the parallel corpus of 40 languages, parallel corpus EUROPARL7, created from the European Parliament Proceedings, and the Bulgarian Web 2012 corpus bgTenTen12, cf. Q1.2.
Q1.4: What linguistic annotation schemas are used in Bulgarian corpora?
For word-level morphosyntactic annotation, most corpora use the BulTreeBank tagset, which is based on the Bulgarian MULTEXT-East tagset. The description can be found in the reports BTB-TR03 and BTB-TR04 of the BulTreeBank project.
For syntactic annotation, two tagsets are used, the HPSG-based one and the Universal Dependencies project one. The Universal Dependencies project also contains a feature set for annotating morphosyntax.
Q1.5: Where can I download Bulgarian resources?
Bulgarian resources can be downloaded from several places:
- LINDAT/CLARIN repository
- PORTULAN CLARIN repository
- MetaShare repository
- Universal Dependencies webpage
- CLARIN.SI repository
In addition to the resources mentioned above and below, the CLARIN.SI repository offers:
- manually annotated corpora and datasets, including the Annotated Corpus of Pre-Standardized Balkan Slavic Literature, and the parallel corpus MULTEXT-East “1984”, annotated with morphosyntactic descriptions (PoS tags) and lemmas.
- other corpora and datasets, including the web corpus MaCoCu-bg with 3.5 billion words, also available as a genre-enriched version inside the MaCoCu-Genre corpus collection, the Bulgarian-English parallel corpus MaCoCu-bg-en, the linguistically annotated corpus of parliamentary debates ParlaMint.ana, the sentiment annotated Twitter corpus, the parallel sense-annotated corpus ELEXIS-WSD, the concreteness and imageability lexicon MEGA.HR-Crossling, and a lexicon of emoji characters with automatically assigned sentiment.
2. Tools to annotate Bulgarian texts
Q2.1: How can I perform basic linguistic processing of my Bulgarian texts?
The UDPipe tool also has a module for Bulgarian, which performs tokenisation, morphosyntactic annotations and lemmatisation (as well as dependency parsing).
The well-known TreeTagger also offers a module for tagging Bulgarian.
The best results including further annotation levels are currently achieved with the Bulgarian Linguistic Pipe based on the CLaRK system and its related trained models. The Pipe will be made publicly available shortly. The system can be downloaded and customised locally for various tasks. The site also contains a manual and demo examples. The CLaRK system provides a built-in tokeniser, however the morphosyntactic tagger and lemmatiser models are not available on-line.
To annotate your texts for these levels, please send your request as plain text to info@clada-bg.eu. The service is free. It can be performed in two ways: either the text is provided to us and processed by CLaDA-BG, or a customized version of the pipeline will be provided with short training on how to use it. The complete pipeline consists of a tokeniser, sentence splitter, named entity recogniser, PoS and morphosyntactic tagger, lemmatiser, dependency parser and semantic parser. All these modules can be used together or as separate modules.
In addition to this, the state-of-the-art CLASSLA-Stanza pipeline provides processing of standard Bulgarian on the levels of tokenisation and sentence splitting, part-of-speech tagging, lemmatisation, dependency parsing, and named entity recognition. For Bulgarian, the CLASSLA-Stanza pipeline uses the rule-based reldi-tokeniser. There are also available off-the-shelf models for lemmatisation and part-of-speech tagging of standard Bulgarian. The documentation for the installation and use of the pipeline is available here. You can try out the pipeline at the CLASSLA Annotation tool website.
Q2.2: How can I standardize my texts prior to further processing?
Currently there are no technologies available for standardizing texts in Bulgarian.
Q2.3: How can I annotate my texts for named entities?
Named entity recognition is provided by the CLASSLA-Stanza pipeline, which also offers an off-the shelf model for standard Bulgarian.
Alternatively, there is a NER tool for Bulgarian that is grammar-based and it also relies on a gazetteer of names. To use it, please get in touch via info@clada-bg.eu.
Q2.4: How can I syntactically parse my texts?
You can syntactically parse Bulgarian texts in the following ways:
- by using the state-of-the-art CLASSLA-Stanza pipeline (Universal Dependencies formalism), which also offers an off-the-shelf model
- by using the offline dependency Bulgarian-specific parser which is a part of the CLaRK system, cf. Q2.1. To use it, please get in touch via info@clada-bg.eu;
- by using the UDPipe tool which has off-the-shelf models for many languages, Bulgarian included, and uses the Universal Dependencies formalism.
3. Datasets to train Bulgarian annotation tools
Q3.1: Where can I get word embeddings for Bulgarian?
- The embeddings trained on the MaCoCu-bg web corpus (4 billion tokens) is the CLARIN.SI-embed.bg embedding collection.
- You can also download embeddings from the FastText webpage or the CoNLL2017 word embeddings.
Q3.2: What data is available for training a text normaliser for Bulgarian?
Currently there are no datasets available for training normalisers for Bulgarian.
Q3.3: What data is available for training a part-of-speech tagger for Bulgarian?
For training purposes you can use:
- the BulTreeBank (BTB) corpus that is part of the Universal Dependencies;
- pre-trained models TnT, SVMtool and Acpost taggers that are available at the BulTreeBank web.
Q3.4: What data is available for training a lemmatiser for Bulgarian?
The BulTreeBank (BTB), which is a part of the Universal Dependencies.
Q3.5: What data is available for training a named entity recogniser for Bulgarian?
For training the named entity recogniser of standard language, the following resources can be used:
- the BulTreeBank (BTB) in the Universal Dependencies;
- the specially designed Corpus of Named Entities from the Shared task on Balto-Slavic languages 2019;
- the Bulgarian part of the Multilingual WSD/NER corpus from PORTULAN CLARIN repository.
Q3.6: What data is available for training a syntactic parser for Bulgarian?
The BulTreeBank (BTB) that is part of the Universal Dependencies.