This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on firstname.lastname@example.org, Subject “FAQ_Macedonian”.
The questions in this FAQ are organised into three main sections:
1. Online Macedonian language resources
Q1.1: Where can I find Macedonian dictionaries?
Below we list the main lexical resources:
- The Digital Macedonian dictionary (Дигитален речник на македонскиот јазик), available on the Macedonian Language Portal, is a multilingual dictionary of Macedonian language consisting of approximately 76,000 headwords and more than 120,000 word forms that are included in the embedded corpus. It enables searching of strings at the beginning, at the end or within the entry in the Cyrillic and Latin alphabet. Dictionary entries include the word forms of the headwords, English and Albanian translations, examples from the embedded corpus, which illustrate the usage of the words, as well as their synonyms. The dictionary is copyright protected. Some of the basic data can be freely used for educational and scientific purposes, after obtaining a written approval by SAM97 GmbH. The dictionary has a mobile version, and a mobile application, which can be downloaded from the Google Play.
- Reverse dictionary of Macedonian language consists of approximately 150,000 entries, alphabetized by suffixes.
- Acronym and abbreviation dictionary contains more than 2,000 acronyms, mainly administrative, and a few abbreviations.
- The Dialectological dictionary of the Research Center for Areal Linguistics is a digital database of Macedonian dialects, which purpose is to provide about 50,000 dialect lexemes and word forms.
- A digitised Dictionary of Macedonian surnames is offered by the Institute of Macedonian Language “Krste Misirkov”.
- Verb conjugators Fleximac and Vigna are useful for verb annotation, analysis and synthesis. In addition to 8 tenses and 4 verbal constructions, the Vigna‘s output also includes the corresponding verbal noun, verbal adjectives, and the gerund.
- mkLex is an inflectional lexicon of Macedonian language, consisting of nearly 90,000 lexemes and 1,300,000 word forms; it can be downloaded from the CLARIN.SI repository (file wfl-mk.txt.gz).
Q1.2: How can I analyse Macedonian corpora online?
CLARIN.SI offers access to two concordancers, which share the same set of corpora and back-end, but have different front-ends:
- NoSketch Engine, an open-source variant of the well-known Sketch Engine. No registration is necessary or possible, which has some drawbacks, e.g., not being able to save your screen settings or make private subcorpora.
- KonText, with a somewhat different user interface. Basic functionality is provided without logging in, but to use more advanced functionalities, it is necessary to log in via your home institution.
Documentation on how to query corpora via the SketchEngine-like interfaces is available here.
Note that the commercial Sketch Engine also offers access to some Macedonian corpora, as well as some additional tools that are not accessible on the free NoSketch Engine, which allow users to compute frequency lists of multiword expressions (N-grams) and to create their own corpora. For researchers in the EU, access to SketchEngine is free for non-commercial purposes in 2018-2022.
Q1.3: Which Macedonian corpora can I analyse online?
In addition to this, the Macedonian Language Portal includes a corpus of Macedonian language, which can be queried via its specialized interface. Similarly to the connected dictionary, some of the basic data can be freely used for educational and scientific purposes, after obtaining a written approval by SAM97 GmbH.
Furthermore, the Sketch Engine, which is free for non-commercial purposes in 2018-2022, currently provides the following Macedonian corpora: OPUS2 Macedonian, which is a part of the parallel corpus of 40 languages.
Q1.4: What linguistic annotation schemas are used in Macedonian corpora?
While most of the above corpora are not linguistically annotated, the most recent ones (such as the Wikipedia corpus CLASSLAWiki-mk) are annotated according to the MULTEXT-East Morphosyntactic Specifications for Macedonian language. For the syntactic analysis with dependency relations, the Universal Dependencies project annotation scheme has been used.
Q1.5: Where can I download Macedonian resources?
The main point for archiving and downloading Macedonian language resources is the repository of CLARIN.SI.
In addition to the resources mentioned above and below, the repository offers:
- the Annotated Corpus of Pre-Standardized Balkan Slavic Literature, which is manually lemmatised, morphosyntactically tagged and syntactically analysed with dependency relations,
- the concreteness and imageability lexicon MEGA.HR-Crossling that contains concreteness and imageability predictions of words in 77 languages, Macedonian included,
- and the annotated parallel corpus MULTEXT-East “1984”. The Macedonian part is annotated for morphosyntactic descriptions (PoS tags) and lemmas, however, they have not been disambiguated, i.e., the words in the novel are annotated with all the theoretically possible tags and lemmas.
Another point where you can find Macedonian resources is the MetaShare repository, which includes the SouthEast European Parallel Corpus SETimes, and a set of parallel and monolingual corpora Parallel Global Voices.
In addition to this, there are several other Macedonian resources, which can be downloaded or requested:
- Macedonian Academy of Sciences and Arts has a large collection of Digital Resources, which include old books and dictionaries, the historical development of the Macedonian language, digital resources for Macedonian dialects, grammars, studies, comparative Slavic and Balkan studies, Macedonian language bibliography, and an Electronic corpus, a free collection consisting of literary texts, which is intended for research purposes and can be downloaded.
- Knigoteka gives access to more than 1,000 online books. They can be used for research purposes with written permission by the Foundation Makedonika.
- Small corpus of Macedonian folklore consists of 185 short stories, 46 legends, and 1,293 jokes.
- Samoglas is an online platform with 30 audio books. It is free upon registration.
- DLib corpus consists of 30 audio books which can be used for speech research.
- News aggregators: time.mk, grid.mk, daily.mk, and vesti.mk consist of a large amount of texts which could be collected into corpora.
2. Tools to annotate Macedonian texts
Q2.1: How can I perform basic linguistic processing of my Macedonian texts?
The state-of-the-art CLASSLA pipeline provides processing of standard Macedonian on the levels of tokenisation and sentence splitting, part-of-speech tagging, and lemmatisation. For Macedonian, the CLASSLA pipeline uses the rule-based reldi-tokeniser. There are also available off-the-shelf models for lemmatisation and part-of-speech tagging of standard Macedonian.
The documentation for the installation and use of the pipeline is available here.
In addition to this, the Macedonian spaCy, presented here, provides tokenisation, lemmatisation, and part-of-speech tagging of Macedonian language as well. The documentation for the installation is available here.
Q2.2: How can I standardize my texts prior to further processing?
There are currently no normalisers available for the Macedonian language.
Q2.3: How can I annotate my texts for named entities?
Q2.4: How can I syntactically parse my texts?
There are currently no syntactic parsers available for the Macedonian language. Once the previously mentioned processing levels are taken care of, annotating a contemporary training corpus with Universal Dependencies should probably be performed.
3. Datasets to train Macedonian annotation tools
Q3.1: Where can I get word embeddings for Macedonian?
The embeddings trained on the largest collection of Macedonian textual data (a 320-million tokens web crawl of the .mk domain) is the CLARIN.SI-embed.mk embedding collection. There are also collections of trained embeddings for Macedonian available from fastText.
Q3.2: What data is available for training a text normaliser for Macedonian?
There is no training data available for training a text normaliser for Macedonian.
Q3.3: What data is available for training a part-of-speech tagger for Macedonian?
There is currently no contemporary training data available for training part-of-speech taggers for Macedonian, but efforts are being made to release the MULTEXT-East 1984 corpus as a manually annotated and disambiguated corpus for part-of-speech and lemmas. The current version has undisambiguated part-of-speech tags and lemmas.
Q3.4: What data is available for training a lemmatiser for Macedonian?
- The mkLex inflectional lexicon can be used to train a lemmatiser for Macedonian.
- There is currently no contemporary training corpus available for training a lemmatiser for Macedonian, but efforts are made to release the MULTEXT-East 1984 corpus as a manually annotated corpus for part-of-speech tags and lemmas.
Q3.5: What data is available for training a named entity recogniser for Macedonian?
There are currently no corpora of Macedonian that are manually annotated for named entities and that could serve as a training set.
The State Statistical Office provides the frequency of all the names and surnames of Macedonian citizens which occur at least 5 times. This dictionary could be used as a reference document for a named entity recogniser. A digitised dictionary of Macedonian surnames of the Institute of Macedonian Language “Krste Misirkov” could be useful as well.
Q3.6: What data is available for training a syntactic parser for Macedonian?
There is currently no contemporary training data available for training a syntactic parser for Macedonian.