Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije
Common Language Resources and Technology Infrastructure, Slovenia

FAQ for Macedonian language resources and technologies

This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on helpdesk.classla@clarin.si, Subject “FAQ_Macedonian”.

The questions in this FAQ are organised into three main sections:

1. Online Macedonian language resources

Q1.1: Where can I find Macedonian dictionaries?

Below is the list of the main lexical resources:

  • The Digital Macedonian dictionary (Дигитален речник на македонскиот јазик) is a multilingual dictionary of Macedonian language consisting of approximately 76,000 headwords and more than 120,000 word forms that are present in the embedded corpus. It enables searching of strings at the beginning, at the end, or within the entry. Strings can be written in the Cyrillic or Latin alphabet. If the string is an existing headword, the dictionary presents it with some of its word forms, and its English and Albanian translations. It also presents examples from the embedded corpus, which illustrate the word usage, as well as its synonyms, when they exist. The dictionary is copyright protected. Some of the basic data can be freely used for educational and scientific purposes, after obtaining a written approval by SAM97 GmbH. The dictionary has a mobile version, and a mobile application, which can be downloaded from Google Play.
  • Reverse dictionary of Macedonian language consists of approximately 150,000 entries, ordered according to word endings. They are free for use.
  • Acronym and abbreviation dictionary contains more than 2,000 acronyms, mainly administrative and few abbreviations. They are free for use.
  • Two verb conjugators: Fleximac, and Vigna are very useful, particularly for verb annotation, analysis and synthesis. Apart from 8 tenses and 4 verbal constructions, Vigna presents the corresponding verbal noun, verbal adjectives and the gerund.
  • mkLex is an inflectional lexicon of Macedonian language, consisting of nearly 90,000 lexemes and 1,300,000 word-forms; it can be downloaded from the CLARIN.SI repository (file wfl-mk.txt.gz).

Q1.2: How can I analyse Macedonian corpora online?

There are currently no specific platforms for searching through Macedonian corpora.

Q1.3: Which Macedonian corpora can I analyse online?

There are several corpora or digital text collections, which can be searched:

  • Macedonian Academy of Sciences and Arts has a large collection of Digital Resources, which include: old books and dictionaries, the historical development of the Macedonian language, digital resources for Macedonian dialects, grammars, studies, comparative Slavic and Balkan studies, Macedonian language bibliography and an Electronic corpus. Within the Electronic corpus, a free collection consisting of 106 literary texts written in Macedonian language, which is intended for research purposes.
  • Knigoteka gives access to more than 1,000 online books. They can be used for research purposes with written permission by Foundation Makedonika.
  • Small corpus of Macedonian folklore consists of 185 short stories, 46 legends, and 1,293 jokes. They are free for use.
  • A large corpus of Macedonian language is a complementary part of the Digital Macedonian dictionary, and it is associated with the selected lexemes. Similarly to the dictionary, some of the basic data can be freely used for educational and scientific purposes, after obtaining a written approval by SAM97 GmbH.
  • Samoglas is an online platform with 30 audio books. It is free upon registration.
  • DLib corpus consists of 30 audio books, which can be used for speech research.
  • Maika is a very successful text-to-speech system, created using Deep Learning.
  • News aggregators: time.mk, grid.mk, daily.mk, and vesti.mk have large corpora, however none of them can be analysed online.

Q1.4: What linguistic annotation schemas are used in Macedonian corpora?

While none of the above corpora are linguistically annotated, the MULTEXT-East Morphosyntactic Specifications cover also the Macedonian language.

Q1.5: Where can I download Macedonian resources?

  • The Macedonian Academy of Sciences and Arts offers, in the scope of its Electronic corpus, 135 Volumes of Macedonian Literature. The collections is available for downloading under the CC Attribution-NonCommercial 4.0 License.
  • The SI repository of language resources and tools hosts Macedonian language resources developed in the scope of the MULTEXT-East project, in particular the large mkLex inflectional lexicon, consisting of nearly 90,000 lexemes and 1,300,000 word-forms and the annotated novel “1984” in a number of languages, including the Macedonian translation. Note, however, that while the Macedonian “1984” is annotated for morphosyntactic descriptions (PoS tags) and lemmas, these have not been disambiguated, i.e. the words in the novel are annotated with all the theoretically possible tags and lemmas.

2. Tools to annotate Macedonian texts

Q2.1: How can I perform basic linguistic processing of my Macedonian texts?

  • Currently there is only the ReLDI-tokeniser available for processing Macedonian language. We are working on releasing a training dataset which would allow for addition of Macedonian to the CLASSLA pipeline on the levels of part-of-speech tagging and lemmatization.

Q2.2: How can I standardize my texts prior to further processing?

There are currently no normalizers available for the Macedonian language.

Q2.3: How can I annotate my texts for named entities?

There are currently no named entity recognizers available for the Macedonian language. Given the rather low annotation costs for enabling named entity recognition, we consider this task to be the next priority of CLASSLA after a tagger and lemmatiser is made available for Macedonian.

Q2.4: How can I syntactically parse my texts?

There are currently no syntactic parsers available for the Macedonian language. Once the previously mentioned processing levels are taken care of, annotating a training corpus with Universal Dependencies should probably be performed.


3. Datasets to train Macedonian annotation tools

Q3.1: Where can I get word embeddings for Macedonian?

There are no embedding collections available for the Macedonian language.

Q3.2: What data is available for training a text normaliser for Macedonian?

There is no training data available for training a text normaliser for Macedonian.

Q3.3: What data is available for training a part-of-speech tagger for Macedonian?

There is currently no training data available for training part-of-speech taggers for Macedonian, but efforts are being made to release the MULTEXT-East 1984 corpus as a manually annotated and disambiguated corpus for part-of-speech and lemmas. Namely, the current version has only un-disambiguated part-of-speech tags or lemmas.

Q3.4: What data is available for training a lemmatiser for Macedonian?

  • The mkLex inflectional lexicon can be used to train a lemmatiser for Macedonian.
  • There is currently no training corpus available for training a lemmatiser for Macedonian, but efforts are made to release the MULTEXT-East 1984 corpus as a manually annotated corpus for part-of-speech and lemmas.

Q3.5: What data is available for training a named entity recogniser for Macedonian?

  • There are currently no corpora of Macedonian that would be manually annotated for named entities and could serve as a training set.
  • The Dictionary of personal names and surnames provided by the State Statistical Office provides the frequency of all the names and surnames of citizens in Macedonia, which exist at least 5 times. This dictionary could be used as a gazeteer for a named entity recognizer.

Q3.6: What data is available for training a syntactic parser for Macedonian?

There is currently no training data available for training a syntactic parser for Macedonian.