Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije
Common Language Resources and Technology Infrastructure, Slovenia

FAQ for Macedonian language resources and technologies

This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on helpdesk.classla@clarin.si, Subject “FAQ_Macedonian”.

The questions in this FAQ are organised into three main sections:

1. Online Macedonian language resources

Q1.1: Where can I find Macedonian dictionaries?

Below we list the main lexical resources:

  • The Digital Macedonian dictionary (Дигитален речник на македонскиот јазик), available on the Macedonian Language Portal, is a multilingual dictionary of Macedonian language consisting of approximately 76,000 headwords and more than 120,000 word forms that are included in the embedded corpus. It enables searching of strings at the beginning, at the end or within the entry in the Cyrillic and Latin alphabet. Dictionary entries include the word forms of the headwords, English and Albanian translations, examples from the embedded corpus, which illustrate the usage of the words, as well as their synonyms. The dictionary is copyright protected. Some of the basic data can be freely used for educational and scientific purposes, after obtaining a written approval by SAM97 GmbH. The dictionary has a mobile version, and a mobile application, which can be downloaded from the Google Play.
  • Reverse dictionary of Macedonian language consists of approximately 150,000 entries, alphabetized by suffixes.
  • Acronym and abbreviation dictionary contains more than 2,000 acronyms, mainly administrative, and a few abbreviations.
  • The Dialectological dictionary of the Research Center for Areal Linguistics is a digital database of Macedonian dialects, which purpose is to provide about 50,000 dialect lexemes and word forms.
  • A digitised Dictionary of Macedonian surnames is offered by the Institute of Macedonian Language “Krste Misirkov”.
  • Verb conjugators Fleximac and Vigna are useful for verb annotation, analysis and synthesis. In addition to 8 tenses and 4 verbal constructions, the Vigna‘s output also includes the corresponding verbal noun, verbal adjectives, and the gerund.
  • mkLex is an inflectional lexicon of Macedonian language, consisting of nearly 90,000 lexemes and 1,300,000 word forms; it can be downloaded from the CLARIN.SI repository (file wfl-mk.txt.gz).

Q1.2: How can I analyse Macedonian corpora online?

CLARIN.SI offers access to four concordancers, which share the same set of corpora and back-end, but have different front-ends:

  • CLARIN.SI Crystal noSketch Engine, an open-source variant of the well-known Sketch Engine.  Instructions for its use are available here. CLARIN.SI offers two installations of Crystal noSketch Engine: an open installation (no log-in, which simplifies use for less advanced users) and a version with log-in which allows subcorpus creation and personalised display of e.g. corpus attributes.
  • KonText, with a somewhat different user interface. Basic functionality is provided without logging in, but to use more advanced functionalities, it is necessary to log in via your home institution.
  • CLARIN.SI Bonito noSketch Engine is the old version of noSketch Engine with a radically different user interface from Crystal. This version offers some functions that the new noSketch Engine does not, in particular, accessing the results of queries in XML, where it is enough to add the parameter “format=XML” to the end of the query URL.

Documentation on how to query corpora via the SketchEngine-like interfaces is available here.

Note that the commercial Sketch Engine also offers access to some Macedonian corpora, as well as some additional tools that are not accessible on the free NoSketch Engine, which allow users to compute frequency lists of multiword expressions (N-grams) and to create their own corpora.

Q1.3: Which Macedonian corpora can I analyse online?

For a complete list of corpora available under CLARIN.SI concordancers, see the index for Crystal noSkE, Bonito noSkE or KonText. They currently include two Macedonian corpora:

  • the first Macedonian linguistically-annotated general corpus: CLASSLA-web.mk web corpus (560 million tokens)
  • the Wikipedia corpus CLASSLAWiki-mk

In addition to this, the Macedonian Language Portal includes a corpus of Macedonian language, which can be queried via its specialized interface. Similarly to the connected dictionary, some of the basic data can be freely used for educational and scientific purposes, after obtaining a written approval by SAM97 GmbH.

Furthermore, the commercial Sketch Engine currently provides the following Macedonian corpora: OPUS2 Macedonian, which is a part of the parallel corpus of 40 languages.

Q1.4: What linguistic annotation schemas are used in Macedonian corpora?

The first manually annotated linguistic training corpus for Macedonian is SETimes.MK. The morphosyntactic labels were assigned following the MULTEXT-East standard for Macedonian. Please note that the dataset does not completely follow the Universal Dependencies specifications for Macedonian, as the UPOS+FEATS features in the dataset take as their basis the MULTEXT-East specifications, which differ in certain respects from the Universal Dependencies for Macedonian one.

The Wikipedia corpus CLASSLAWiki-mk is annotated according to the MULTEXT-East Morphosyntactic Specifications for Macedonian language. For the syntactic analysis with dependency relations, the Universal Dependencies project annotation scheme has been used.

Q1.5: Where can I download Macedonian resources?

The main point for archiving and downloading Macedonian language resources is the repository of CLARIN.SI.

In addition to the resources mentioned above and below, the repository offers:

  • the Macedonian monolingual web corpus MaCoCu-mk with 524 million words and the Macedonian-English parallel corpus MaCoCu-mk-en,
  • the manually annotated linguistic training corpus for Macedonian SETimes.MK,
  • the corpus of Macedonian language-related news articles MetaLangNEWS-Mk,
  • the Annotated Corpus of Pre-Standardized Balkan Slavic Literature, which is manually lemmatised, morphosyntactically tagged and syntactically analysed with dependency relations,
  • the concreteness and imageability lexicon MEGA.HR-Crossling that contains concreteness and imageability predictions of words in 77 languages, Macedonian included,
  • the choice of plausible alternatives dataset in Macedonian COPA-MK,
  • and the annotated parallel corpus MULTEXT-East “1984”. The Macedonian part is annotated for morphosyntactic descriptions (PoS tags) and lemmas, however, they have not been disambiguated, i.e., the words in the novel are annotated with all the theoretically possible tags and lemmas.

Another point where you can find Macedonian resources is the MetaShare repository, which includes the SouthEast European Parallel Corpus SETimes, and a set of parallel and monolingual corpora Parallel Global Voices.

In addition to this, there are several other Macedonian resources, which can be downloaded or requested:

  • Macedonian Academy of Sciences and Arts has a large collection of Digital Resources, which include old books and dictionaries, the historical development of the Macedonian language, digital resources for Macedonian dialects, grammars, studies, comparative Slavic and Balkan studies, Macedonian language bibliography, and an Electronic corpus, a free collection consisting of literary texts, which is intended for research purposes and can be downloaded.
  • Knigoteka gives access to more than 1,000 online books. They can be used for research purposes with written permission by the Foundation Makedonika.
  • Small corpus of Macedonian folklore consists of 185 short stories, 46 legends, and 1,293 jokes.
  • Samoglas is an online platform with 30 audio books. It is free upon registration.
  • DLib corpus consists of 30 audio books which can be used for speech research.
  • News aggregators: time.mk, grid.mk, daily.mk, and vesti.mk consist of a large amount of texts which could be collected into corpora.

2. Tools to annotate Macedonian texts

Q2.1: How can I perform basic linguistic processing of my Macedonian texts?

The state-of-the-art CLASSLA pipeline provides processing of standard Macedonian on the levels of tokenisation and sentence splitting, part-of-speech tagging, and lemmatisation. For Macedonian, the CLASSLA pipeline uses the rule-based reldi-tokeniser. There are also available off-the-shelf models for lemmatisation and part-of-speech tagging of standard Macedonian. You can try out the pipeline at the CLASSLA Annotation tool website.

The documentation for the installation and use of the pipeline is available here.

In addition to this, the Macedonian spaCy, presented here, provides tokenisation, lemmatisation, and part-of-speech tagging of Macedonian language as well. The documentation for the installation is available here.

Q2.2: How can I standardize my texts prior to further processing?

There are currently no normalisers available for the Macedonian language.

Q2.3: How can I annotate my texts for named entities?

Named entity recognition of Macedonian language is provided by the Macedonian spaCy pipeline which can be downloaded here.

Q2.4: How can I syntactically parse my texts?

There are currently no syntactic parsers available for the Macedonian language. Once the previously mentioned processing levels are taken care of, annotating a contemporary training corpus with Universal Dependencies should probably be performed.


3. Datasets to train Macedonian annotation tools

Q3.1: Where can I get word embeddings for Macedonian?

The embeddings trained on the largest collection of Macedonian textual data (a 930-million tokens web crawl of the .mk domain) is the CLARIN.SI-embed.mk embedding collection. There are also collections of trained embeddings for Macedonian available from fastText.

Q3.2: What data is available for training a text normaliser for Macedonian?

There is no training data available for training a text normaliser for Macedonian.

Q3.3: What data is available for training a part-of-speech tagger for Macedonian?

The reference dataset for training a standard tagger is the manually annotated linguistic training corpus SETimes.MK. Additionally, efforts are being made to release the MULTEXT-East 1984 corpus as a manually annotated and disambiguated corpus for part-of-speech and lemmas. The current version has undisambiguated part-of-speech tags and lemmas.

Q3.4: What data is available for training a lemmatiser for Macedonian?

Lemmatisers can be trained either on the tagger training data (SETimes.MK, see the section on PoS tagger training for details) and/or on the inflectional lexicon mkLex. Additionally, efforts are made to release the MULTEXT-East 1984 corpus as a manually annotated corpus for part-of-speech tags and lemmas.

Q3.5: What data is available for training a named entity recogniser for Macedonian?

There are currently no corpora of Macedonian that are manually annotated for named entities and that could serve as a training set.

The State Statistical Office provides the frequency of all the names and surnames of Macedonian citizens which occur at least 5 times. This dictionary could be used as a reference document for a named entity recogniser. A digitised dictionary of Macedonian surnames of the Institute of Macedonian Language “Krste Misirkov” could be useful as well.

Q3.6: What data is available for training a syntactic parser for Macedonian?

There is currently no contemporary training data available for training a syntactic parser for Macedonian. While the Macedonian-MTB treebank from the Universal Dependencies datasets is too small for training, it can be used for evaluation of syntactic parsers.