{"id":4101,"date":"2019-08-02T14:08:45","date_gmt":"2019-08-02T14:08:45","guid":{"rendered":"http:\/\/www.clarin.si\/info\/?page_id=4101"},"modified":"2024-12-19T14:29:25","modified_gmt":"2024-12-19T14:29:25","slug":"faq4bulgarian","status":"publish","type":"page","link":"https:\/\/www.clarin.si\/info\/k-centre\/faq4bulgarian\/","title":{"rendered":"FAQ for Bulgarian language resources and technologies"},"content":{"rendered":"<p>This FAQ is part of the documentation of the\u00a0<a href=\"https:\/\/www.clarin.si\/info\/k-center\/\">CLASSLA<\/a> CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please let us know on <a href=\"mailto:helpdesk.classla@clarin.si?subject=FAQ_Bulgarian\">helpdesk.classla@clarin.si<\/a>, Subject \u201cFAQ_Bulgarian\u201d.<\/p>\n<p>The questions in this FAQ are organised into three main sections:<\/p>\n\n<h2>1. Online Bulgarian language resources<\/h2>\n<p>Disclaimer: Please note that CLaDA-BG is in operation from 2018. During the second phase, 2019-2020, a dedicated repository hosting language resources and tools for Bulgarian is planned to be set in operation. It will then become possible for these resources and tools to be deposited to CLARIN-ERIC.<\/p>\n<h4 id=\"q11\">Q1.1: Where can I find Bulgarian dictionaries?<\/h4>\n<p>The main dictionary portals offered by CLaDA-BG partners or supported by CLaDA-BG are the following:<\/p>\n<ul>\n<li>the <a href=\"https:\/\/clada-bg.eu\/en\/centers-and-services\/language-technologies\/btb-wordnet.html\" target=\"_blank\" rel=\"noopener\">Bulgarian semantic WordNet lexicon BTB-WordNet<\/a>;<\/li>\n<li><a href=\"https:\/\/hdl.handle.net\/21.11129\/0000-000B-D386-F\" target=\"_blank\" rel=\"noopener\">specialized lexicon for Bulgarian and other languages<\/a> can be used for the translation of specific IT domain expressions. The lexicons for Basque, Bulgarian, Czech, Dutch, English, Portuguese and Spanish were developed in the QTLeap project and can be downloaded from the <a href=\"https:\/\/portulanclarin.net\/repository\/search\/\" target=\"_blank\" rel=\"noopener\">PORTULAN CLARIN<\/a> repository.<\/li>\n<\/ul>\n<p>Dictionaries by other providers:<\/p>\n<p>The Institute for Bulgarian Language at the Bulgarian Academy of Sciences has provided a number of on-line dictionaries:<\/p>\n<ul>\n<li><a href=\"https:\/\/dcl.bas.bg\/bulnet\/\" target=\"_blank\" rel=\"noopener\">BulNet<\/a><\/li>\n<li><a href=\"https:\/\/ibl.bas.bg\/rbe\/\" target=\"_blank\" rel=\"noopener\">Dictionary of Bulgarian Language<\/a><\/li>\n<li><a href=\"https:\/\/ibl.bas.bg\/infolex\/synonyms.php\" target=\"_blank\" rel=\"noopener\">Bulgarian Synonyms<\/a><\/li>\n<li><a href=\"https:\/\/ibl.bas.bg\/infolex\/antonyms.php\" target=\"_blank\" rel=\"noopener\">Bulgarian Antonyms<\/a><\/li>\n<li><a href=\"https:\/\/ibl.bas.bg\/infolex\/idioms.php\" target=\"_blank\" rel=\"noopener\">Bulgarian Phraseology<\/a><\/li>\n<li><a href=\"https:\/\/ibl.bas.bg\/infolex\/neologisms.php\" target=\"_blank\" rel=\"noopener\">Bulgarian Neologisms<\/a><\/li>\n<li><a href=\"https:\/\/www.bgspeech.net\/\" target=\"_blank\" rel=\"noopener\">Bulgarian Speech Corpora<\/a><\/li>\n<\/ul>\n<h4>Q1.2: How can I analyse Bulgarian corpora online?<\/h4>\n<p>Bulgarian corpora can be analysed on-line through the following platforms:<\/p>\n<ul>\n<li><a href=\"http:\/\/webclark.org\/\" target=\"_blank\" rel=\"noopener\">WebClark concordancer<\/a> by Institute of Information and Communication Technologies;<\/li>\n<li>the <a href=\"http:\/\/search.dcl.bas.bg\/\">Concordancer<\/a> by the <a href=\"https:\/\/ibl.bas.bg\/en\/resursi\/\" target=\"_blank\" rel=\"noopener\">Institute for Bulgarian Language<\/a> at the Bulgarian Academy of Sciences;<\/li>\n<li>some Bulgarian corpora are available through the <a href=\"https:\/\/www.clarin.si\/info\/concordances\/\" target=\"_blank\" rel=\"noopener\">CLARIN.SI concordancers<\/a>, i.e. <a href=\"https:\/\/www.clarin.si\/ske\/\" target=\"_blank\" rel=\"noopener noreferrer\">CLARIN.SI Crystal noSketch Engine<\/a> (an <a href=\"https:\/\/www.clarin.si\/ske\/\" target=\"_blank\" rel=\"noopener\">open version without log-in<\/a> and a <a href=\"https:\/\/www.clarin.si\/skelog\" target=\"_blank\" rel=\"noopener\">version with log-in<\/a> which allows subcorpus creation and personalised display of e.g. corpus attributes), <a href=\"https:\/\/www.clarin.si\/noske\/\" target=\"_blank\" rel=\"noopener noreferrer\">CLARIN.SI Bonito noSketch Engine<\/a> and\u00a0<a href=\"https:\/\/www.clarin.si\/kontext\/\" target=\"_blank\" rel=\"noopener\">KonText<\/a>, which share the same set of corpora and back-end, but have different front-ends;<\/li>\n<li>the commercial <a href=\"https:\/\/www.sketchengine.eu\/\" target=\"_blank\" rel=\"noopener\">Sketch Engine<\/a> also offers access to several <a href=\"https:\/\/www.sketchengine.eu\/corpora-and-languages\/bulgarian-text-corpora\/\" target=\"_blank\" rel=\"noopener\">Bulgarian language corpora<\/a>, as well as some additional tools, including the tools to analyse <span class=\"bluet_tooltip tooltipy-kw tooltipy-kw-3920\" data-tooltip=\"3920\">collocation<\/span>s (<a href=\"https:\/\/www.sketchengine.eu\/guide\/word-sketch-collocations-and-word-combinations\/\" target=\"_blank\" rel=\"noopener\">Word sketches<\/a>), synonyms and antonyms (<a href=\"https:\/\/www.sketchengine.eu\/guide\/thesaurus-synonyms-antonyms-similar-words\/\" target=\"_blank\" rel=\"noopener\">Thesaurus<\/a>), and the tools to compute frequency lists of multiword expressions (<a href=\"https:\/\/www.sketchengine.eu\/guide\/n-grams-multiword-expressions\/\" target=\"_blank\" rel=\"noopener\">N-grams<\/a>). It also allows users to create their own corpora.<\/li>\n<\/ul>\n<h4>Q1.3: Which Bulgarian corpora can I analyse online?<\/h4>\n<p>Below we list the main corpora portals offered by CLaDA-BG partners or supported by CLaDA-BG:<\/p>\n<ul>\n<li><a href=\"http:\/\/webclark.org\/\" target=\"_blank\" rel=\"noopener\">Bulgarian Reference Corpus<\/a><\/li>\n<li><a href=\"http:\/\/political.webclark.org\/\" target=\"_blank\" rel=\"noopener\">Corpus of Bulgarian and Journalistic Speech<\/a><\/li>\n<li><a href=\"http:\/\/dar.webclark.org\/\" target=\"_blank\" rel=\"noopener\">Corpus of Culture for Giving for Education (CoDAR)<\/a><\/li>\n<li><a href=\"https:\/\/korpus.juls.savba.sk\/skbg_en.html\" target=\"_blank\" rel=\"noopener\">Bilingual Bulgarian-Slovak parallel corpus<\/a><\/li>\n<\/ul>\n<p>Corpora by other providers:<\/p>\n<ul>\n<li><a href=\"http:\/\/search.dcl.bas.bg\" target=\"_blank\" rel=\"noopener\">Bulgarian National Corpus (BNC)<\/a><\/li>\n<li>the following Bulgarian corpora are available under <a href=\"https:\/\/www.clarin.si\/info\/concordances\/\" target=\"_blank\" rel=\"noopener\">CLARIN.SI concordancers<\/a> (<a href=\"https:\/\/www.clarin.si\/ske\/\" target=\"_blank\" rel=\"noopener noreferrer\">Crystal noSkE<\/a>, <a href=\"https:\/\/www.clarin.si\/noske\/\" target=\"_blank\" rel=\"noopener noreferrer\">Bonito noSkE<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.clarin.si\/kontext\/\" target=\"_blank\" rel=\"noopener\">KonText<\/a>): the web corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_bg\" target=\"_blank\" rel=\"noopener\">CLASSLA-web.bg<\/a> (3.9 billion tokens), the parliamentary corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlamint41_bg\" target=\"_blank\" rel=\"noopener\">ParlaMint-BG<\/a>, the Wikipedia corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlawiki_bg\" target=\"_blank\" rel=\"noopener\">CLASSLAWiki-bg<\/a>, the Bulgarian part of the multilingual parallel corpus of the EU law translation memory <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=dgtud_bg\" target=\"_blank\" rel=\"noopener\">EU DGT-UD<\/a>, and the Bulgarian part of multilingual European parliamentary corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlamint41_xx\" target=\"_blank\" rel=\"noopener\">ParlaMint-XX<\/a>, paired with the machine-translated English corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlamint41_xx_en\" target=\"_blank\" rel=\"noopener\">ParlaMint-XX-en<\/a><\/li>\n<li>the commercial <a href=\"https:\/\/www.sketchengine.eu\" target=\"_blank\" rel=\"noopener\">Sketch Engine<\/a>\u00a0includes the <a href=\"https:\/\/www.sketchengine.eu\/corpora-and-languages\/bulgarian-text-corpora\/\" target=\"_blank\" rel=\"noopener\">following Bulgarian corpora<\/a>: <a href=\"https:\/\/www.sketchengine.eu\/eurlex-corpus\/\" target=\"_blank\" rel=\"noopener\">EUR-Lex Bulgarian 2\/2016<\/a>, <a href=\"https:\/\/www.sketchengine.eu\/opus-parallel-corpora\/\" target=\"_blank\" rel=\"noopener\">OPUS2 Bulgarian<\/a>, which is a part of the parallel corpus of 40 languages, parallel corpus <a href=\"https:\/\/www.sketchengine.eu\/europarl-parallel-corpus\/\" target=\"_blank\" rel=\"noopener\">EUROPARL7<\/a>, created from the European Parliament Proceedings, and the Bulgarian Web 2012 corpus <a href=\"https:\/\/www.sketchengine.eu\/bgtenten-bulgarian-corpus\/\" target=\"_blank\" rel=\"noopener\">bgTenTen12<\/a>, cf. Q1.2.<\/li>\n<\/ul>\n<h4>Q1.4: What linguistic annotation schemas are used in Bulgarian corpora?<\/h4>\n<p>For word-level morphosyntactic annotation, most corpora use the BulTreeBank tagset, which is based on the <a href=\"https:\/\/nl.ijs.si\/ME\/V6\/msd\/html\/msd-bg.html\" target=\"_blank\" rel=\"noopener\">Bulgarian MULTEXT-East tagset<\/a>. The description can be found in the reports <a href=\"http:\/\/bultreebank.org\/wp-content\/uploads\/2017\/04\/BTB-TR03.pdf\" target=\"_blank\" rel=\"noopener\">BTB-TR03<\/a> and <a href=\"http:\/\/bultreebank.org\/wp-content\/uploads\/2017\/04\/BTB-TR04.pdf\" target=\"_blank\" rel=\"noopener\">BTB-TR04<\/a> of the <a href=\"http:\/\/bultreebank.org\/en\/\" target=\"_blank\" rel=\"noopener\">BulTreeBank project<\/a>.<\/p>\n<p>For syntactic annotation, two tagsets are used, the <a href=\"http:\/\/bultreebank.org\/wp-content\/uploads\/2017\/04\/BTB-TR05.pdf)\" target=\"_blank\" rel=\"noopener\">HPSG-based one<\/a> and the <a href=\"https:\/\/universaldependencies.org\/\" target=\"_blank\" rel=\"noopener\">Universal Dependencies project<\/a> one. The <a href=\"https:\/\/universaldependencies.org\/\" target=\"_blank\" rel=\"noopener\">Universal Dependencies project<\/a> also contains a <a href=\"https:\/\/universaldependencies.org\/u\/pos\/index.html\" target=\"_blank\" rel=\"noopener\">feature set for annotating morphosyntax<\/a>.<\/p>\n<h4>Q1.5: Where can I download Bulgarian resources?<\/h4>\n<p>Bulgarian resources can be downloaded from several places:<\/p>\n<ul>\n<li><a href=\"https:\/\/lindat.mff.cuni.cz\/repository\/xmlui\/\" target=\"_blank\" rel=\"noopener\">LINDAT\/CLARIN repository<\/a><\/li>\n<li><a href=\"https:\/\/portulanclarin.net\/repository\/search\/?q=&amp;selected_facets=languageNameFilter_exact%3ABulgarian\" target=\"_blank\" rel=\"noopener\">PORTULAN CLARIN repository<\/a><\/li>\n<li><a href=\"http:\/\/metashare.ilsp.gr:8080\/repository\/search\/?q=&amp;selected_facets=languageNameFilter_exact%3ABulgarian\" target=\"_blank\" rel=\"noopener\">MetaShare repository<\/a><\/li>\n<li><a href=\"https:\/\/universaldependencies.org\/\" target=\"_blank\" rel=\"noopener\">Universal Dependencies webpage<\/a><\/li>\n<li><a href=\"https:\/\/www.clarin.si\/repository\/xmlui\/discover?filtertype=language&amp;filter_relational_operator=equals&amp;filter=Bulgarian\" target=\"_blank\" rel=\"noopener\">CLARIN.SI repository<\/a><\/li>\n<\/ul>\n<p>In addition to the resources mentioned above and below, the <a href=\"https:\/\/www.clarin.si\/repository\/xmlui\/discover?filtertype=language&amp;filter_relational_operator=equals&amp;filter=Bulgarian\" target=\"_blank\" rel=\"noopener\">CLARIN.SI repository<\/a> offers:<\/p>\n<ul>\n<li><em>manually annotated corpora and datasets<\/em>, including the <a href=\"http:\/\/hdl.handle.net\/11356\/1441\" target=\"_blank\" rel=\"noopener\">Annotated Corpus of Pre-Standardized Balkan Slavic Literature<\/a>, and the parallel corpus <a href=\"http:\/\/hdl.handle.net\/11356\/1043\" target=\"_blank\" rel=\"noopener\">MULTEXT-East \u201c1984\u201d<\/a>, annotated with morphosyntactic descriptions (PoS tags) and lemmas.<\/li>\n<li><em>other corpora and datasets<\/em>, including the web corpus <a href=\"http:\/\/hdl.handle.net\/11356\/1800\" target=\"_blank\" rel=\"noopener\">MaCoCu-bg<\/a> with 3.5 billion words, also available as a genre-enriched version inside the <a href=\"http:\/\/hdl.handle.net\/11356\/1969\" target=\"_blank\" rel=\"noopener\">MaCoCu-Genre<\/a> corpus collection, the Bulgarian-English parallel corpus <a href=\"http:\/\/hdl.handle.net\/11356\/1815\" target=\"_blank\" rel=\"noopener\">MaCoCu-bg-en<\/a>, the linguistically annotated corpus of parliamentary debates <a href=\"http:\/\/hdl.handle.net\/11356\/1911\" target=\"_blank\" rel=\"noopener\">ParlaMint.ana<\/a>, the sentiment annotated <a href=\"http:\/\/hdl.handle.net\/11356\/1054\" target=\"_blank\" rel=\"noopener\">Twitter corpus<\/a>, the parallel sense-annotated corpus <a href=\"http:\/\/hdl.handle.net\/11356\/1842\" target=\"_blank\" rel=\"noopener\">ELEXIS-WSD<\/a>, the concreteness and imageability lexicon <a href=\"http:\/\/hdl.handle.net\/11356\/1187\" target=\"_blank\" rel=\"noopener\">MEGA.HR-Crossling<\/a>, and a\u00a0<a href=\"http:\/\/hdl.handle.net\/11356\/1048\" target=\"_blank\" rel=\"noopener\">lexicon of emoji characters<\/a> with automatically assigned sentiment.<\/li>\n<\/ul>\n<hr class=\"shortcode hr blue\" style=\"width:100%;border-width:3px;\" \/>\n<h2>2. Tools to annotate Bulgarian texts<\/h2>\n<h4>Q2.1: How can I perform basic linguistic processing of my Bulgarian texts?<\/h4>\n<p>The <a href=\"https:\/\/ufal.mff.cuni.cz\/udpipe\/models\" target=\"_blank\" rel=\"noopener\">UDPipe tool<\/a> also has a module for Bulgarian, which performs tokenisation, morphosyntactic annotations and lemmatisation (as well as dependency parsing).<\/p>\n<p>The well-known <a href=\"https:\/\/www.cis.uni-muenchen.de\/~schmid\/tools\/TreeTagger\/\" target=\"_blank\" rel=\"noopener\">TreeTagger<\/a> also offers a module for tagging Bulgarian.<\/p>\n<p>The best results including further annotation levels are currently achieved with the Bulgarian Linguistic Pipe based on the <a href=\"http:\/\/bultreebank.org\/bg\/clark\/\" target=\"_blank\" rel=\"noopener\">CLaRK system<\/a> and its related trained models. The Pipe will be made publicly available shortly. The system can be downloaded and customised locally for various tasks. The site also contains a manual and demo examples. The CLaRK system provides a built-in tokeniser, however the morphosyntactic tagger and lemmatiser models are not available on-line.<\/p>\n<p>To annotate your texts for these levels, please send your request as plain text to <a href=\"mailto:info@clada-bg.eu\">info@clada-bg.eu<\/a>. The service is free. It can be performed in two ways: either the text is provided to us and processed by CLaDA-BG, or a customized version of the pipeline will be provided with short training on how to use it. The complete pipeline consists of a tokeniser, sentence splitter, named entity recogniser, PoS and morphosyntactic tagger, lemmatiser, dependency parser and semantic parser. All these modules can be used together or as separate modules.<\/p>\n<p>In addition to this, the state-of-the-art <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> provides processing of standard Bulgarian on the levels of tokenisation and sentence splitting, part-of-speech tagging, lemmatisation, dependency parsing, and named entity recognition. For Bulgarian, the CLASSLA-Stanza pipeline uses the rule-based <a href=\"https:\/\/github.com\/clarinsi\/reldi-tokeniser\" target=\"_blank\" rel=\"noopener\">reldi-tokeniser<\/a>. There are also available off-the-shelf models for <a href=\"http:\/\/hdl.handle.net\/11356\/1850\" target=\"_blank\" rel=\"noopener\">lemmatisation<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1849\" target=\"_blank\" rel=\"noopener\">part-of-speech tagging<\/a> of standard Bulgarian. The documentation for the installation and use of the pipeline is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md\" target=\"_blank\" rel=\"noopener\">here<\/a>. You can try out the pipeline at the <a href=\"https:\/\/clarin.si\/oznacevalnik\/eng\" target=\"_blank\" rel=\"noopener\">CLASSLA Annotation tool<\/a> website.<\/p>\n<h4>Q2.2: How can I standardize my texts prior to further processing?<\/h4>\n<p>Currently there are no technologies available for standardizing texts in Bulgarian.<\/p>\n<h4>Q2.3: How can I annotate my texts for named entities?<\/h4>\n<p>Named entity recognition is provided by the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a>, which also offers an <a href=\"http:\/\/hdl.handle.net\/11356\/1329\" target=\"_blank\" rel=\"noopener\">off-the shelf model<\/a> for standard Bulgarian.<\/p>\n<p>Alternatively, there is a NER tool for Bulgarian that is grammar-based and it also relies on a gazetteer of names. To use it, please get in touch via <a href=\"mailto:info@clada-bg.eu\">info@clada-bg.eu<\/a>.<\/p>\n<h4>Q2.4: How can I syntactically parse my texts?<\/h4>\n<p>You can syntactically parse Bulgarian texts in the following ways:<\/p>\n<ul>\n<li>by using the state-of-the-art <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> (<a href=\"https:\/\/universaldependencies.org\/\" target=\"_blank\" rel=\"noopener\">Universal Dependencies formalism<\/a>), which also offers an <a href=\"http:\/\/hdl.handle.net\/11356\/1851\" target=\"_blank\" rel=\"noopener\">off-the-shelf model<\/a><\/li>\n<li>by using the offline dependency Bulgarian-specific parser which is a part of the <a href=\"http:\/\/bultreebank.org\/bg\/clark\/\" target=\"_blank\" rel=\"noopener\">CLaRK system<\/a>, cf. Q2.1. To use it, please get in touch via <a href=\"mailto:info@clada-bg.eu\">info@clada-bg.eu<\/a>;<\/li>\n<li>by using the <a href=\"https:\/\/ufal.mff.cuni.cz\/udpipe\" target=\"_blank\" rel=\"noopener\">UDPipe tool<\/a> which has off-the-shelf models for many languages, Bulgarian included, and uses the <a href=\"https:\/\/universaldependencies.org\/\" target=\"_blank\" rel=\"noopener\">Universal Dependencies formalism<\/a>.<\/li>\n<\/ul>\n<hr class=\"shortcode hr blue\" style=\"width:100%;border-width:3px;\" \/>\n<h2>3. Datasets to train Bulgarian annotation tools<\/h2>\n<h4>Q3.1: Where can I get word embeddings for Bulgarian?<\/h4>\n<ul>\n<li>The embeddings trained on the MaCoCu-bg web corpus (4 billion tokens) is the <a href=\"http:\/\/hdl.handle.net\/11356\/1796\" target=\"_blank\" rel=\"noopener\">CLARIN.SI-embed.bg<\/a> embedding collection.<\/li>\n<li>You can also download embeddings from the <a href=\"https:\/\/fasttext.cc\/docs\/en\/crawl-vectors.html\" target=\"_blank\" rel=\"noopener\">FastText webpage<\/a> or the <a href=\"http:\/\/hdl.handle.net\/11234\/1-1989\" target=\"_blank\" rel=\"noopener\">CoNLL2017 word embeddings<\/a>.<\/li>\n<\/ul>\n<h4>Q3.2: <span class=\"lwptoc_item_label\">What data is available for training a text normaliser for Bulgarian?<\/span><\/h4>\n<p>Currently there are no datasets available for training normalisers for Bulgarian.<\/p>\n<h4>Q3.3: What data is available for training a part-of-speech tagger for Bulgarian?<\/h4>\n<p>For training purposes you can use:<\/p>\n<ul>\n<li>the BulTreeBank (BTB) corpus that is part of the <a href=\"https:\/\/universaldependencies.org\/\" target=\"_blank\" rel=\"noopener\">Universal Dependencies<\/a>;<\/li>\n<li>pre-trained models TnT, SVMtool and Acpost taggers that are available at the <a href=\"http:\/\/bultreebank.org\/en\/resources\/part-speech-tagging-bultreebank-bulgarian-taggers\/\" target=\"_blank\" rel=\"noopener\">BulTreeBank<\/a> web.<\/li>\n<\/ul>\n<h4>Q3.4: What data is available for training a lemmatiser for Bulgarian?<\/h4>\n<p>The BulTreeBank (BTB), which is a part of the <a href=\"https:\/\/universaldependencies.org\/\" target=\"_blank\" rel=\"noopener\">Universal Dependencies<\/a>.<\/p>\n<h4>Q3.5: What data is available for training a named entity recogniser for Bulgarian?<\/h4>\n<p>For training the named entity recogniser of standard language, the following resources can be used:<\/p>\n<ul>\n<li>the BulTreeBank (BTB) in the <a href=\"https:\/\/universaldependencies.org\/\" target=\"_blank\" rel=\"noopener\">Universal Dependencies<\/a>;<\/li>\n<li>the specially designed Corpus of Named Entities from the <a href=\"http:\/\/bsnlp.cs.helsinki.fi\/bsnlp-2019\/shared_task.html\" target=\"_blank\" rel=\"noopener\">Shared task on Balto-Slavic languages 2019<\/a>;<\/li>\n<li>the Bulgarian part of the <a href=\"https:\/\/hdl.handle.net\/21.11129\/0000-000B-D36A-0\" target=\"_blank\" rel=\"noopener\">Multilingual WSD\/NER corpus<\/a> from PORTULAN CLARIN repository.<\/li>\n<\/ul>\n<h4>Q3.6: What data is available for training a syntactic parser for Bulgarian?<\/h4>\n<p>The BulTreeBank (BTB) that is part of the <a href=\"https:\/\/universaldependencies.org\/\" target=\"_blank\" rel=\"noopener\">Universal Dependencies<\/a>.<\/p>\n\n<p>&nbsp;<\/p>\n<div id=\"themify_builder_content-4101\" data-postid=\"4101\" class=\"themify_builder_content themify_builder_content-4101 themify_builder\">\n    <\/div>\n<!-- \/themify_builder_content -->\n","protected":false},"excerpt":{"rendered":"<p>This FAQ is part of the documentation of the\u00a0CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please let us know on helpdesk.classla@clarin.si, Subject \u201cFAQ_Bulgarian\u201d. The questions in this FAQ are organised into three main sections: 1. Online Bulgarian language resources Disclaimer: Please note that CLaDA-BG is in [&hellip;]<\/p>\n","protected":false},"author":9,"featured_media":0,"parent":3558,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-4101","page","type-page","status-publish","hentry","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/4101","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/9"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=4101"}],"version-history":[{"count":60,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/4101\/revisions"}],"predecessor-version":[{"id":7836,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/4101\/revisions\/7836"}],"up":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3558"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=4101"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}