This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on firstname.lastname@example.org, Subject “FAQ_Croatian”.
The questions in this FAQ are organised into three main sections:
1. Online Croatian language resources
Q1.1: Where can I find Croatian dictionaries?
Below we list the main lexical resources:
- Hrvatski Jezični Portal offers search over the largest dictionary database of Croatian language (the Novi Liber dictionary database)
- The Institute for Croatian language and linguistics offers the Spelling dictionary, the Dictionary of Phrasemes, the Valency Dictionary, the Collocation Dictionary, and the Croatian Terminology Portal, which offers central access to various terminological dictionaries and a list of other terminology resources
- The Croatian Old Dictionary Portal allows you to search through digitised dictionaries from between 16th and 19th century
- The Miroslav Krleža Institute of Lexicography offers access to a series of on-line lexicons
- Kontekst.io is a lexicon of semantically related words, automatically produced on the basis of word-embeddings from large corpora
- Termania is a portal of free online dictionaries, offered by the Amebis company
- Wječnik, the Croatian Wiktionary, is a multilingual, openly accessible and openly editable dictionary
- CroWN is a lexical database for Croatian and CroDeriV is a morphological database of Croatian verbs
- hrLex is the largest inflectional lexicon of Croatian language, consisting of 186,743 lexemes and 6,428,577 entries; it is searchable through the CLARIN.SI web interface (Anonymous login, Lexicon)
Q1.2: How can I analyse Croatian corpora online?
CLARIN.SI offers access to three concordancers, which share the same set of corpora and back-end, but have different front-ends:
- CLARIN.SI Crystal noSketch Engine, an open-source variant of the well-known Sketch Engine. Instructions for its use are available here. No registration is necessary or possible, which has some drawbacks, e.g., not being able to save your screen settings or make private subcorpora.
- KonText, with a somewhat different user interface. Basic functionality is provided without logging in, but to use more advanced functionalities, it is necessary to log in via your home institution.
- CLARIN.SI Bonito noSketch Engine is the old version of noSketch Engine with a radically different user interface from Crystal. This version offers some functions that the new noSketch Engine does not, in particular, accessing the results of queries in XML, where it is enough to add the parameter “format=XML” to the end of the query URL.
Documentation on how to query corpora via the SketchEngine-like interfaces is available here.
Note that the commercial Sketch Engine also offers access to several Croatian language corpora, as well as some additional tools that are not accessible on the free NoSketch Engine, including the tools to analyse collocations (Word sketches), synonyms and antonyms (Thesaurus), the tools to compute frequency lists of multiword expressions (N-grams) and to extract keywords and terms. It also allows users to create their own corpora.
Q1.3: Which Croatian corpora can I analyse online?
For a complete list of corpora available under CLARIN.SI concordancers, see the index for Crystal noSkE, Bonito noSkE or KonText. Below we list the Croatian ones, with links to the Crystal noSketch Engine concordancer:
- general language corpora are the web corpus hrWaC (1.4 billion tokens), the Riznica Croatian Language Corpus (100 million tokens) of the Institute for Croatian language and linguistics, which consists of literary works and newspaper texts and which you can query via its specialized interface as well, and The Croatian National Corpus (HNK) of the Institute of Linguistics
- specialized corpora include the parliamentary corpus ParlaMint-HR, the Wikipedia corpus for Croatian, CLASSLAWiki-hr, and Serbo-Croatian, CLASSLAWiki-sh, the corpus of news portals ENGRI, and the corpus of tweets Tweet-hr
- manually annotated corpora include the training corpus of standard language hr500k, the corpora of non-professional written language Raput-cln (speakers with language disorders) and Raput-ncln (typical speakers), and the training corpus of computer-mediated communication ReLDI-hr with manually normalised (standardised), morphosyntactically tagged and lemmatised words and named entities
- parallel corpora include the multilingual DGT translation memory corpus EU DGT-UD: Croatian
In addition to these, the Croatian Spoken Language Corpus is available through TalkBank. The latter platform also offers access to a small language development corpus of three participants, the Kovačević Corpus, the Croatian part of the comparable corpora CHILDES, which consists of transcripts of child language for 24 languages. Furthermore, the CroLTeC corpus, a learner corpus of Croatian, can be queried via the TeiTok interface.
Q1.4: What linguistic annotation schemas are used in Croatian corpora?
Most of these corpora are annotated according to the MULTEXT-East morphosyntactic specifications. The more recent ones use the Version 6 specifications for the Serbo-Croatian macrolanguage. More recent corpora also use the Universal Dependencies project annotation scheme, in particular that for Croatian and Serbian. Named entities are annotated via the Janes NE guidelines.
Q1.5: Where can I download Croatian resources?
The main point for archiving and downloading Croatian language resources is the repository of CLARIN.SI.
In addition to the resources mentioned above and below, the repository offers:
- manually annotated corpora and datasets, including the Sentiment Annotated Dataset of Croatian News, the sentiment annotated corpus of parliamentary debates ParlaSent-BCS, and the offensive language dataset FRENK, annotated for different types of socially unacceptable discourse
- other corpora and datasets, including the largest Croatian corpus – the web corpus MaCoCu-hr with 2.3 billion words, the linguistically annotated corpus of parliamentary debates ParlaMint.ana, the automatic speech recognition training dataset ParlaSpeech-HR, the 24sata news article archive and news comment dataset, the LiLaH emotion lexicon, the Twitter corpus, the automatically constructed multiword lexicon hrMWELex, the text collection for training the BERTić transformer model BERTić-data, the Keyword extraction dataset, the news dataset SETimes.HBS and the Twitter dataset Twitter-HBS for discriminating between Bosnian, Croatian, Montenegrin and Serbian, and the Croatian-English parallel corpora MaCoCu-hr-en, hrenWaC and the Tourism Corpus.
Another point where you can find Croatian resources is the MetaShare repository, which includes the sentiment lexicon CroSentilex, the valency lexicon CROVALLEX, and the South-East European Parallel Corpus SETimes Corpus.
2. Tools to annotate Croatian texts
Q2.1: How can I perform basic linguistic processing of my Croatian texts?
The state-of-the-art CLASSLA pipeline provides processing of standard and non-standard (Internet) Croatian on the levels of tokenisation and sentence splitting, part-of-speech tagging, lemmatisation, dependency parsing, and named entity recognition. For Croatian, the CLASSLA pipeline uses the rule-based reldi-tokeniser. There are also available off-the-shelf models for lemmatisation of standard and non-standard Croatian, and part-of-speech tagging of standard and non-standard Croatian.
The documentation for the installation and use of the pipeline is available here.
In addition to this, tokenisation, part-of-speech tagging, and lemmatisation are provided by a CLARIN.SI service ReLDIanno as well. The documentation for using the service is available here. It can be used via a web interface or as a web service. You can also install the same tools locally, namely the tokeniser, and part-of-speech tagger and lemmatiser.
Q2.2: How can I standardize my texts prior to further processing?
- Currently, the only on-line text normalisation tool available through the CLARIN.SI services (ReLDIanno) is the REDI diacritic restorer. Its usage is documented here. You can also download it, install it and use it locally.
- For word-level normalisation of user-generated Croatian texts, you can download and install the CSMTiser text normaliser.
Q2.3: How can I annotate my texts for named entities?
Named entity recognition is provided by the CLASSLA pipeline, which also offers off-the shelf models for standard and non-standard Croatian. In addition to this, on-line NER is available via the CLARIN.SI service ReLDIanno. You can also download the janes-ner tool.
Q2.4: How can I syntactically parse my texts?
You can syntactically parse Croatian texts, following the Universal Dependencies formalism, in multiple ways:
- by using the state-of-the-art CLASSLA pipeline, which also offers an off-the-shelf model
- by using the CLARIN.SI service ReLDIanno
- by using the UDPipe tool, which has off-the-shelf models for many languages, Croatian included
3. Datasets to train Croatian annotation tools
Q3.1: Where can I get word embeddings or pre-trained language models for Croatian?
- The embeddings trained on the largest collection of Croatian textual data (hrWaC, Riznica, 24sata newspaper texts and comments, etc.) is the CLARIN.SI-embed.hr embedding collection.
- There are also collections of trained embeddings for Croatian available from fastText.
- If you want to train your own embeddings, the largest freely available collection of Croatian texts is the BERTić-data text collection.
You can also use a transformer language model BERTić, a state-of-the-art model representing words/tokens as contextually dependent word embeddings. It allows you to extract word embeddings for every word occurrence, which can then be used in training a model for an end task.
Q3.2: What data is available for training a text normaliser for Croatian?
For training text normalisers for Internet Croatian, the ReLDI-NormTagNER-hr dataset can be used, a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Croatian.
Q3.3: What data is available for training a part-of-speech tagger for Croatian?
You can also use the CLASSLA pipeline in combination with the CLARIN.SI embeddings and the training dataset hr500k to train and evaluate your own part-of-speech tagger. The documentation is available here.
Q3.4: What data is available for training a lemmatiser for Croatian?
For training your own lemmatiser for standard and non-standard Croatian, you can use the CLASSLA pipeline, which uses the external lexicon for lemmatisation (hrLex). The documentation is available here.
Q3.5: What data is available for training a named entity recogniser for Croatian?
Q3.6: What data is available for training a syntactic parser for Croatian?
If you require additional annotation layers, e.g., for multi-task learning, the hr500k dataset should be used.