This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on firstname.lastname@example.org, Subject “FAQ Slovene”.
The questions in this FAQ are organised into three main sections:
- Online Slovene language resources
- Tools to annotate Slovene texts
- Datasets to train Slovene annotation tools
- Where can I get word embeddings for Slovene?
- What data is available for training a text normaliser for Slovene?
- What data is available for training a part-of-speech tagger for Slovene?
- What data is available for training a lemmatiser for Slovene?
- What data is available for training a named entity recogniser for Slovene?
- What data is available for training a syntactic parser for Slovene?
Online Slovene language resources
Q1.1: Where can I find Slovene dictionaries?
Below we list the main dictionary portals offered by CLARIN.SI partners or supported by CLARIN.SI:
- FRAN offers aggregate search over all the Slovene dictionaries (general, etymological, historical, terminological, and dialectal) of the the Institute for Slovene language at ZRC
- Thesaurus, Collocation dictionary and a small glossary of Twitterese is available at the Ljubljana University CJVT infrastructure centre
- Kontext.io is a lexicon of semantically related words, for Slovene, Croatian and Serbian, automatically produced on the basis of word-embedding from large corpora
- Termania is a portal of free online dictionaries of various languages and fields offered by the Amebis company
- Sloleks, a Slovenian morphological lexicon, SSSJ, a test Slovene on-line dictionary and LBS, a prototype Slovene lexical database are among the results of the “Communication in Slovene” project, with the portal hosted at CLARIN.SI
- sloWNet, the WordNet-based Slovenian semantic lexicon, IMP, a glossary of archaic Slovene, and jaSlo, a Japanese-Slovene learners’ dictionary is offered by the Jožef Stefan Institute
- Razvezani jezik, a dictionary of “living Slovene” is offered by the Domestic Research Society
Dictionaries by other providers:
- Evroterm, multilingual terminology database and a list of on-line dictionaries, by the Government of the Republic of Slovenia
- Islovar, a terminological dictionary for the field of Informatics by the Slovene Society for Informatics
- Wikislovar, the Slovene Wiktionary, i.e. the multilingual, openly accessible and openly editable dictionary.
Q1.2: How can I analyse Slovene corpora online?
CLARIN.SI offers access to two concordancers, which share the same set of (Slovene) corpora and back-end, but have different front-ends:
- NoSketch Engine, an open-source variant of the well-known Sketch Engine. No registration is necessary or possible, which also has drawbacks, e.g. not being able to save your screen settings or making private subcorpora.
- Kontext, with a somewhat different user interface. Basic functionality is provided without logging in, but to use more advanced functionalities, it is necessary to log in via AAI through you identity provider.
Documentation on how to query corpora via the SketchEngine-like interfaces is available here.
Note that the commercial Sketch Engine also offers access to several Slovene language corpora. Furthermore, for researchers in the EU, access to SketchEngine is free for non-commerical purposes in 2018-2022.
Q1.3: Which Slovene corpora can I analyse online?
The main reference corpus for Slovene is Gigafida (1 billion words), which you can query via its specialized interface, via noSkE or KonText. Note that the corpus is also available in a version which has (near) duplicate paragraphs removed, cf. noSkE or KonText. A balanced subset of Gigafida is KRES (100 million tokens), which you can query via its specialized interface.
- a general language corpus (apart from Gigafida) is slWaC, a large corpus (900 million tokens) of Slovene texts form the Web
- specialized corpora include the corpus of academic writing KAS, the corpus of user-generated content Janes, the spoken corpus GOS, the corpus of historical Slovene IMP and the developmental corpus ŠOLAR
- manually annotated corpora include the reference training corpus ssj500k (sampled from the Gigafida corpus), the corpus of historical Slovene goo300k (sampled from the IMP corpus), the corpus of user-generated content Janes-norm (sampled from the Janes corpus), which is manually annotated with normalized word-forms, and
- Janes-tag (sampled from Janes-norm), also manually annotated with morphosyntactic descriptions, lemmas, and named entities.
Q1.4: What linguistic annotation schemas are used in Slovene corpora?
Most of these corpora are annotated on the level of morphosyntax with the MULTEXT East tagset. On the level of syntax, there are two tagsets, one Slovene-specific and another from the Universal Dependencies project. The Universal Dependencies project already contains a tagset for annotating morphosyntax, which is currently only applied in training corpora. Named entities are annotated via the guidelines developed for South Slavic languages.
Q1.5: Where can I download Slovene resources?
The main point for archiving and downloading Slovene language resources is the repository of CLARIN.SI.
Tools to annotate Slovene texts
Q2.1: How can I perform basic linguistic processing of my Slovene texts?
Tokenisation, part-of-speech tagging and lemmatisation on your texts can be done via the CLARIN.SI services. The documentation for using the services, either via a web interface, or as a web service, is available here. You can also install the same tools locally, namely the tokenizer and part-of-speech tagger and lemmatizer.
Q2.2: How can I standardize my texts prior to further processing?
- Currently, the only text on-line normalization tool available through the CLARIN.SI services is the REDI diacritic restorer. The usage of the CLARIN.SI services is documented here. You can also download this REDI diacritic restorer, install it and use it locally.
- For word-level normalisation of e.g. historical and user-generated Slovene texts you can download and install the CSMTiser text normalizer.
Q2.3: How can I annotate my texts for named entities?
Q2.4: How can I syntactically parse my texts?
You can syntactically parse Slovene texts in multiple ways:
- by using the CLARIN.SI services (Universal Dependencies formalism)
- by using the UDPipe tool which has off-the-shelf models for many languages, Slovene included (Universal Dependencies formalism)
- by using the Razclenjevalnik tool (Slovene-specific formalism)
Datasets to train Slovene annotation tools
Q3.1: Where can I get word embeddings for Slovene?
- The embeddings trained on the largest collection of Slovene textual data (Gigafida, slWaC, JANES, KAS etc.) is the CLARIN.SI-embed.sl embedding collection.
- There are also collections of trained embeddings for Slovene available from SketchEngine and from fastText.
- If you want to train your own embeddings, the largest freely available collection of Slovene texts is the Slovene portion of Commoncrawl.
Q3.2: What data is available for training a text normaliser for Slovene?
Q3.3: What data is available for training a part-of-speech tagger for Slovene?
The reference dataset for training a standard tagger is ssj500k. There is also a silver-standard extension of the ssj500k dataset available, jos1M. There are also training datasets available for Internet Slovene (Janes-tag) and for historical Slovene (goo300k).
Q3.4: What data is available for training a lemmatiser for Slovene?
Q3.5: What data is available for training a named entity recogniser for Slovene?
For training the named entity recognizer of standard language, ssj500k is the best resource. For training NER systems for online, non-standard texts, Janes-tag can be used. Finally, for training historical NER models, goo300k is the best resource.