This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on helpdesk.classla@clarin.si, Subject “FAQ_Croatian”.
The questions in this FAQ are organised into three main sections:
1. Online Croatian language resources
Q1.1: Where can I find Croatian dictionaries?
Below we list the main lexical resources:
- Hrvatski Jezični Portal offers search over the largest dictionary database of Croatian language (the Novi Liber dictionary database)
- The Spelling dictionary of the Institute for Croatian language and linguistics consists of spelling rules and suggestions for the Croatian language
- The Croatian Terminology Portal of the Institute for Croatian language and linguistics offers central access to various terminological dictionaries
- The Miroslav Krleža Institute of Lexicography offers access to a series of on-line lexicons
- hrLex is the largest inflectional lexicon of Croatian language, consisting of 186,743 lexemes and 6,428,577 entries; it is searchable through the CLARIN.SI web interface (Anonymous login, Lexicon)
Q1.2: How can I analyse Croatian corpora online?
CLARIN.SI offers access to two concordancers, which share the same set of (Croatian) corpora and back-end, but have different front-ends:
- NoSketch Engine, an open-source variant of the well-known Sketch Engine. No registration is necessary or possible, which also has drawbacks, e.g. not being able to save your screen settings or making private subcorpora.
- Kontext, with a somewhat different user interface. Basic functionality is provided without logging in, but to use more advanced functionalities, it is necessary to log in via AAI through you identity provider.
Documentation on how to query corpora via the SketchEngine-like interfaces is available here.
Note that the commercial Sketch Engine also offers access to several Croatian language corpora. Furthermore, for researchers in the EU, access to SketchEngine is free for non-commerical purposes in 2018-2022.
Q1.3: Which Croatian corpora can I analyse online?
These are the main general language corpora:
- The largest corpus of Croatian language is the Croatian web corpus hrWaC (1.4 billion words), which you can query via noSkE or KonText.
- The Croatian Language Corpus of the Institute for Croatian language and linguistics is 100 million tokens large and consists of literary works and newspaper texts. It is available for search through noSkE or KonText.
- The Croatian National Corpus of the Institute of Linguistics is available for search through the noSkE interface.
The main specialised corpora are the following:
- The only learner corpus of Croatian is the CroLTeC corpus which is searchable through the TeiTok interface.
- For European legislation, the Croatian portion of the DGT-UD corpus can be queried through noSkE or KonText.
- The only Croatian corpus of spoken language is the Croatian Spoken Language Corpus available through TalkBank.
- There is a small language development corpus of three participants, the Kovačević corpus, available through TalkBank.
Finally, the main manually annotated corpora are the following:
- The training corpus of standard language (hr500k) is available through noSkE or KonText
- The training corpus of computer-mediated communication (ReLDI-NormTagNER-hr) is available through NoSkE and KonText.
Q1.4: What linguistic annotation schemas are used in Croatian corpora?
Most of these corpora are annotated according to the MULTEXT-East morphosyntactic specifications. The more recent ones use the Version 6 specifications for the Serbo-Croatan macrolanguage. More recent corpora also use the Universal Dependencies project annotation scheme, in particular that for Croatian and Serbian. Named entities are annotated via the Janes NE guidelines.
Q1.5: Where can I download Croatian resources?
The main point for archiving and downloading Croatian language resources is the repository of CLARIN.SI.
Another point for downloading resources in Croatian is the MetaShare repository.
2. Tools to annotate Croatian texts
Q2.1: How can I perform basic linguistic processing of my Croatian texts?
- Tokenisation, part-of-speech tagging and lemmatisation on your texts can be done via the CLARIN.SI services. The documentation for using the services, either via a web interface, or as a web service, is available here. You can also install the same tools locally, namely the tokenizer and part-of-speech tagger and lemmatizer.
Q2.2: How can I standardize my texts prior to further processing?
- Currently, the only text on-line normalization tool available through the CLARIN.SI services is the REDI diacritic restorer. The usage of the CLARIN.SI services is documented here. You can also download this REDI diacritic restorer, install it and use it locally.
- For word-level normalisation of user-generated Croatian texts you can download and install the CSMTiser text normalizer.
Q2.3: How can I annotate my texts for named entities?
- On-line NER is available via the CLARIN.SI services documented here. You can also download this NER tool and use it locally.
Q2.4: How can I syntactically parse my texts?
You can syntactically parse Croatian texts in multiple ways:
- by using the CLARIN.SI services (Universal Dependencies formalism)
- by using the UDPipe tool which has off-the-shelf models for many languages, Croatian included (Universal Dependencies formalism)
3. Datasets to train Croatian annotation tools
Q3.1: Where can I get word embeddings for Croatian?
The embeddings trained on the largest collection of Croatian textual data (hrWaC, Riznica, 24sata newspaper texts and comments etc.) is the CLARIN.SI-embed.hr embedding collection.
There are also collections of trained embeddings for Croatian available from fastText.
If you want to train your own embeddings, the largest freely available collection of Croatian texts is the hrWaC corpus.
Q3.2: What data is available for training a text normaliser for Croatian?
For training text normalisers for Internet Croatian the ReLDI-NormTagNER-hr dataset can be used.
Q3.3: What data is available for training a part-of-speech tagger for Croatian?
The reference dataset for training a standard tagger is hr500k. There is also the ReLDI-NormTagNER-hr training dataset of Internet Croatian.
Q3.4: What data is available for training a lemmatiser for Croatian?
Lemmatisers can be trained either on the tagger training data (hr500k, ReLDI-NormTagNER-hr, see the section on PoS tagger training for details) and/or on the inflectional lexicon hrLex.
Q3.5: What data is available for training a named entity recogniser for Croatian?
For training the named entity recognizer of standard language, hr500k is the best resource. For training NER systems for online, non-standard texts, ReLDI-NormTagNER-hr can be used.
Q3.6: What data is available for training a syntactic parser for Croatian?
If you want to follow the Universal Dependencies formalism for dependency parsing, the best location for obtaining training data is the Universal Dependencies repository.
If you require additional annotation layers, e.g., for multi-task learning, the hr500k dataset should be used.