This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on helpdesk.classla@clarin.si, Subject “FAQ_Slovene”.
The questions in this FAQ are organised into three main sections:
1. Online Slovene language resources
Q1.1: Where can I find Slovene dictionaries?
Below we list the main dictionary portals offered by CLARIN.SI partners or supported by CLARIN.SI:
- FRAN offers aggregate search over all Slovene dictionaries (general, etymological, historical, terminological, and dialectal) of the Fran Ramovš Institute of the Slovenian Language at ZRC. The Institute also offers a School Dictionary of Slovenian Language on the Franček portal.
- Thesaurus, Collocation dictionary, a small glossary of Twitterese, and a Slovene-Hungarian dictionary are available at the Ljubljana University CJVT infrastructure centre.
- Kontekst.io is a lexicon of semantically related words for Slovene, Croatian and Serbian, automatically produced on the basis of word-embeddings from large corpora.
- Termania is a portal of free online dictionaries of various languages and fields, offered by the Amebis company.
- Sloleks, a Slovene morphological lexicon, a test Slovene on-line dictionary SSSJ, and a prototype Slovene lexical database LBS are among the results of the “Communication in Slovene” project, with the portal hosted at CLARIN.SI.
- sloWNet, the WordNet-based Slovenian semantic lexicon, IMP, a glossary of archaic Slovene, and jaSlo, a Japanese-Slovene learners’ dictionary, are offered by the Jožef Stefan Institute.
- Razvezani jezik, a user-generated dictionary of spoken Slovene, is offered by the Domestic Research Society.
- Numerous terminological dictionaries, such as the Slovenian-English glossary of education, the Terminological dictionary of artificial intelligence, the Terminological dictionary of tax terminology and others, can be downloaded from the repository of CLARIN.SI.
Dictionaries by other providers:
- Evroterm, multilingual terminology database, and a list of on-line dictionaries, by the Government of the Republic of Slovenia;
- Islovar, a terminological dictionary for the field of Informatics by the Slovene Society for Informatics;
- Wikislovar, the Slovene Wiktionary, i.e., the multilingual, openly accessible and openly editable dictionary.
Q1.2: How can I analyse Slovene corpora online?
CLARIN.SI offers access to three concordancers, which share the same set of corpora and back-end, but have different front-ends:
- CLARIN.SI Crystal noSketch Engine, an open-source variant of the well-known Sketch Engine. Instructions for its use are available here. No registration is necessary or possible, which has some drawbacks, e.g., not being able to save your screen settings or make private subcorpora.
- KonText, with a somewhat different user interface. Basic functionality is provided without logging in, but to use more advanced functionalities, it is necessary to log in via your home institution.
- CLARIN.SI Bonito noSketch Engine is the old version of noSketch Engine with a radically different user interface from Crystal. This version offers some functions that the new noSketch Engine does not, in particular, accessing the results of queries in XML, where it is enough to add the parameter “format=XML” to the end of the query URL.
Documentation on how to query corpora via the SketchEngine-like interfaces is available here.
Note that the commercial Sketch Engine also offers access to several Slovene language corpora, as well as some additional tools that are not accessible on the free NoSketch Engine, including the tools to analyse collocations (Word sketches), synonyms and antonyms (Thesaurus), the tools to compute frequency lists of multiword expressions (N-grams) and to extract keywords and terms. It also allows users to create their own corpora.
Some Slovene corpora, esp. those produced in the scope of the “Communication in Slovene” project have their specialised web concordancers, cf. the corpora listed in Q1.3.
Q1.3: Which Slovene corpora can I analyse online?
The main reference corpus for Slovene is Gigafida (1 billion words), which you can query via its specialized interface, via Crystal noSkE, Bonito noSkE or KonText. Note that the corpus is also available in a version which has (near) duplicate paragraphs removed, cf. noSkE or KonText. A balanced subset of Gigafida is KRES (100 million tokens), which you can query via its specialized interface.
For a complete list of corpora available under CLARIN.SI concordancers, see the index for Crystal noSkE, Bonito noSkE or KonText. Below we list some of the important ones, with links to the Crystal noSketch Engine concordancer:
- a general language corpus (apart from Gigafida) is slWaC, a large corpus (900 million tokens) of Slovene texts from the Web
- specialized corpora include the corpus of academic writing KAS, the corpus of scientific publications from the Open Science portal OSS, the corpus of user-generated content (blogs, forums, comments, and tweets) Janes, the monitor news corpus Trendi, the spoken corpus GOS, the parliamentary corpora siParl and ParlaMint-SI, the Wikipedia corpus CLASSLAWiki-sl, the corpus of historical Slovene IMP, the Proverbs corpus, the corpus of 100 novels ELTeC-slv, the corpus of youth literature MAKS, and the developmental corpus ŠOLAR
- manually annotated corpora include the reference training corpus ssj500k (sampled from the Gigafida corpus), the corpus of historical Slovene goo300k (sampled from the IMP corpus), the corpus of term-annotated texts RSDO5, and the corpora of user-generated content Janes Norm (sampled from the Janes corpus), which is manually annotated with normalised word-forms, and Janes Tag (sampled from Janes-norm), also manually annotated with morphosyntactic descriptions, lemmas, and named entities
- a meta corpus metaFida, which contains 4 billion tokens, unites the most important publicly accessible Slovene corpora and enables a uniform search through them
- parallel corpora include the multilingual DGT translation memory corpus EU DGT-UD: Slovenian, the Slovene-English corpus TRANS5, the Italian-Slovene corpus ISPAC, the French-Slovene corpus LeMonde, and the Japanese-Slovene corpus jaSlo
Furthermore, the commercial Sketch Engine includes the following Slovene corpora: learner corpus of proofread and translations Lektor, which you can also query via its specialized interface, EUR-Lex Slovenian 2/2016, parallel corpus EUROPARL7, created from the European Parliament Proceedings, and OPUS2, a parallel corpus of 40 languages.
Q1.4: What linguistic annotation schemas are used in Slovene corpora?
Most of these corpora are annotated according to the MULTEXT-East morphosyntactic specifications for Slovene. On the level of syntax, and esp. for older corpora, the Slovene-specific SSJ annotation scheme is used. Corpora are also annotated according to the Universal Dependencies guidelines. Named entities are often annotated following the Janes NE guidelines for Slovene.
Q1.5: Where can I download Slovene resources?
The main point for archiving and downloading Slovene language resources is the repository of CLARIN.SI.
In addition to the resources mentioned above and below, the repository offers:
- manually annotated corpora and datasets, including the corpus of comma placement Vejica 1.3, the dataset of idiomatic expressions SloIE, the corpora of metaphorical expressions KOMET and G-KOMET, the bilingual terminology extraction dataset KAS-biterm, the sentiment annotated news corpus SentiNews, the Semantic change detection datasets for Slovenian, the Tweet code-switching corpus Janes-Preklop, the offensive language dataset FRENK, annotated for different types of socially unacceptable discourse, and the post-edited and error annotated machine translation corpus PErr;
- other parallel corpora, including the Slovene-English parallel corpora MaCoCu-sl-en, slenWaC and RSDO4 1.0, and the Slovene-English parallel corpus of idiomatic text ParaDiom;
- other corpora and datasets, including a large web corpus MaCoCu-sl with 1.8 billion words, the linguistically annotated corpus of parliamentary debates ParlaMint.ana, the sentiment annotated Twitter corpora, the Twitter dataset with automatically assigned hate speech labels, the corpus of textbooks ccUčbeniki, the corpus of 1968 literature Maj68, the Slovenian datasets for contextual synonym and antonym detection, the text simplification dataset SloTS, the natural language inference dataset SI-NLI, and the corpus for general relation extraction SloREL;
- wordlists and other lexical resources, including the core vocabulary for Slovenian as L2, a lexicon of emoji characters with automatically assigned sentiment, and the LiLaH emotion lexicon.
2. Tools to annotate Slovene texts
Q2.1: How can I perform basic linguistic processing of my Slovene texts?
The state-of-the-art CLASSLA pipeline provides processing of standard and non-standard (Internet) Slovene on the levels of tokenization and sentence splitting, part-of-speech tagging, lemmatisation, dependency parsing and named entity recognition. The CLASSLA pipeline uses two tokenizers: rule-based tokenizer Obeliks4J for Slovene standard language pipeline and reldi-tokeniser for other cases. There are also available off-the-shelf models for lemmatisation of standard and non-standard Slovene, and part-of-speech tagging of standard and non-standard Slovene.
The documentation for the installation and use of the pipeline is available here. Furthermore, the CLASSLA pipeline offers some additional features, namely the usage of Slovene-specific dependency parsing system, inflectional lexicon and pretokenized data, which are documented here.
In addition to this, tokenisation, part-of-speech tagging, and lemmatisation are provided by a CLARIN.SI service ReLDIanno as well. The documentation for using the service is available here. It can be used via a web interface or as a web service. You can also install the same tools locally, namely the tokenizer and part-of-speech tagger and lemmatiser.
Q2.2: How can I standardize my texts prior to further processing?
Currently, the only text on-line normalisation tool available through the CLARIN.SI services (ReLDIanno) is the REDI diacritic restorer. Its usage is documented here. You can also download the REDI diacritic restorer, install it and use it locally.
For word-level normalisation of e.g. historical and user-generated Slovene texts, you can download and install the CSMTiser text normaliser.
Q2.3: How can I annotate my texts for named entities?
Named entity recognition is provided by the CLASSLA pipeline, which also offers off-the-shelf models for standard and non-standard Slovene. In addition to this, on-line NER is available via the CLARIN.SI service ReLDIanno. You can also download the janes-ner tool.
Q2.4: How can I syntactically parse my texts?
You can syntactically parse Slovene texts in multiple ways:
- by using the state-of-the-art CLASSLA pipeline (Universal Dependencies formalism), which also offers an off-the-shelf model
- by using the CLARIN.SI service ReLDIanno (Universal Dependencies formalism)
- by using the UDPipe tool which has off-the-shelf models for many languages, Slovene included (Universal Dependencies formalism)
- by using the Slovene Parser (Razčlenjevalnik) tool (Slovene-specific formalism)
3. Datasets to train Slovene annotation tools
Q3.1: Where can I get word embeddings or pre-trained language models for Slovene?
- The embeddings trained on the largest collection of Slovene textual data (Gigafida, slWaC, JANES, KAS etc.) is the CLARIN.SI-embed.sl embedding collection.
- There are also collections of trained embeddings for Slovene available from SketchEngine and from fastText.
- If you want to train your own embeddings, the largest freely available collection of Slovene texts is the Slovene portion of Commoncrawl.
You can also use the Slovene BERT/RoBERTa model SloBERTa, a state-of-the-art model representing words/tokens as contextually dependent word embeddings. It allows you to extract word embeddings for every word occurrence, which can then be used in training a model for an end task. The scripts and programs used for data preparation and training the model are available here.
Q3.2: What data is available for training a text normaliser for Slovene?
For training text normalisers for Internet Slovene, the Janes-norm dataset can be used. For normalising historical data, the goo300k dataset should be used.
Q3.3: What data is available for training a part-of-speech tagger for Slovene?
The reference dataset for training a standard tagger is ssj500k. There is also a silver-standard extension of the ssj500k dataset available, jos1M. There are also training datasets available for Internet Slovene (Janes-tag) and for historical Slovene (goo300k).
You can also use the CLASSLA pipeline in combination with the CLARIN.SI embeddings and the training dataset ssj500k to train and evaluate your own part-of-speech tagger. The documentation is available here.
Q3.4: What data is available for training a lemmatiser for Slovene?
Lemmatisers can be trained either on the tagger training data (ssj500k, jos1M, Janes-tag, goo300k, see the Section on PoS tagger training for details) and/or on the inflectional lexicon Sloleks.
For training your own lemmatiser for standard and non-standard Slovene, you can use the CLASSLA pipeline, which uses the external lexicon for lemmatisation (Sloleks). The documentation is available here.
Q3.5: What data is available for training a named entity recogniser for Slovene?
For training a named entity recogniser of standard language, ssj500k is the best resource. For training NER systems for online, non-standard texts, Janes-tag can be used. Finally, for training historical NER models, goo300k is the best resource.
The CLASSLA pipeline allows you to train your own named entity recogniser as well. The documentation is available here.
Q3.6: What data is available for training a syntactic parser for Slovene?
If you want to follow the Universal Dependencies formalism for dependency parsing, the best location for obtaining training data is the Universal Dependencies repository.
For training parsers by following the Slovene-specific formalism, the ssj500k dataset should be used.
You can also use the CLASSLA pipeline to train your own parser. The documentation is available here.