{"id":3566,"date":"2019-03-20T13:36:01","date_gmt":"2019-03-20T13:36:01","guid":{"rendered":"http:\/\/www.clarin.si\/info\/?page_id=3566"},"modified":"2025-01-08T08:10:10","modified_gmt":"2025-01-08T08:10:10","slug":"faq4croatian","status":"publish","type":"page","link":"https:\/\/www.clarin.si\/info\/k-centre\/faq4croatian\/","title":{"rendered":"FAQ for Croatian language resources and technologies"},"content":{"rendered":"<p>This FAQ is part of the documentation of the <a href=\"..\/\">CLASSLA<\/a> CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on <a href=\"mailto:helpdesk.classla@clarin.si?subject=FAQ_Croatian\">helpdesk.classla@clarin.si<\/a>, Subject &#8220;FAQ_Croatian&#8221;.<\/p>\n<p>The questions in this FAQ are organised into three main sections:<\/p>\n\n<h2 id=\"existing\">1. Online Croatian language resources<\/h2>\n<h4 id=\"q11\">Q1.1: Where can I find Croatian dictionaries?<\/h4>\n<p>Below we list the main lexical resources:<\/p>\n<div id=\"tab-tab-2860-0-0-6-2860-0\" class=\"tab-content\">\n<ul>\n<li><a href=\"http:\/\/hjp.znanje.hr\" target=\"_blank\" rel=\"noopener\">Hrvatski Jezi\u010dni Portal<\/a>\u00a0offers search over the largest dictionary database of Croatian language (the Novi Liber dictionary database)<\/li>\n<li>The <a href=\"http:\/\/ihjj.hr\/\" target=\"_blank\" rel=\"noopener\">Institute for Croatian language and linguistics<\/a> offers the <a href=\"http:\/\/pravopis.hr\/pravila\/\" target=\"_blank\" rel=\"noopener\">Spelling dictionary<\/a>, the <a href=\"http:\/\/frazemi.ihjj.hr\/\" target=\"_blank\" rel=\"noopener\">Dictionary of Phrasemes<\/a>, the <a href=\"http:\/\/rjecnik.hr\/\" target=\"_blank\" rel=\"noopener\">School Dictionary of Croatian language<\/a>, the <a href=\"http:\/\/valencije.ihjj.hr\/\" target=\"_blank\" rel=\"noopener\">Valency Dictionary<\/a>, the <a href=\"http:\/\/ihjj.hr\/kolokacije\/\" target=\"_blank\" rel=\"noopener\">Collocation Dictionary<\/a>, the <a href=\"http:\/\/osobno-ime.hr\/\" target=\"_blank\" rel=\"noopener\">Dictionary of Croatian First Names<\/a>, the<a href=\"https:\/\/metanet.hr\/\" target=\"_blank\" rel=\"noopener\"> Croatian Metaphor Repository MetaNet.HR<\/a>, the database of semantic frames in the field of aviation <a href=\"https:\/\/airframe.jezik.hr\/\" target=\"_blank\" rel=\"noopener\">AirFrame<\/a>, the <span id=\"gmail-Q11_Where_can_I_find_Croatian_dictionaries\"><a href=\"http:\/\/dublete.jezik.hr\/index\/\" target=\"_blank\" rel=\"noopener\">Database of Croatian Morphological Doublets<\/a>, the\u00a0<a href=\"https:\/\/retrogram.jezik.hr\/trazilica\/\" target=\"_blank\" rel=\"noopener\">Portal of Croatian Grammars from the Pre-Illyrian Period<\/a>, <\/span>and the <a href=\"http:\/\/nazivlje.hr\" target=\"_blank\" rel=\"noopener\">Croatian Terminology Portal<\/a>, which offers central access to various terminological dictionaries and a <a href=\"http:\/\/nazivlje.hr\/english\/page\/other-terminology-sources\/9\/\" target=\"_blank\" rel=\"noopener\">list of other terminology resources<\/a><\/li>\n<li>The <a href=\"http:\/\/crodip.ffzg.hr\/default_e.aspx\" target=\"_blank\" rel=\"noopener\">Croatian Old Dictionary Portal<\/a> allows you to search through digitised dictionaries from between 16th and 19th century<\/li>\n<li>The <a href=\"https:\/\/www.lzmk.hr\" target=\"_blank\" rel=\"noopener\">Miroslav Krle\u017ea Institute of Lexicography<\/a> offers access to a series of <a href=\"https:\/\/www.lzmk.hr\/e-leks\/online-izdanja\" target=\"_blank\" rel=\"noopener\">on-line lexicons<\/a>, including the <a href=\"https:\/\/egzonimi.lzmk.hr\/\" target=\"_blank\" rel=\"noopener\">Dictionary of Croatian Exonyms<\/a><\/li>\n<li>The Lexonomy portal offers access to a <a href=\"https:\/\/lexonomy.elex.is\/#\/frazeoloskirjecnikhr\" target=\"_blank\" rel=\"noopener\">Dictionary of Croatian idioms<\/a><\/li>\n<li><a href=\"https:\/\/www.kontekst.io\/hrvatski\" target=\"_blank\" rel=\"noopener\">Kontekst.io<\/a> is a lexicon of semantically related words, automatically produced on the basis of word-embeddings from large corpora<\/li>\n<li><a title=\"http:\/\/www.termania.net\" href=\"https:\/\/www.termania.net\/\" target=\"_blank\" rel=\"noopener\">Termania<\/a> is a portal of free online dictionaries, offered by the <a href=\"http:\/\/www.clarin.si\/info\/partners\/#amebis\">Amebis company<\/a><\/li>\n<li><a href=\"https:\/\/hr.wiktionary.org\/wiki\/Glavna_stranica\" target=\"_blank\" rel=\"noopener\">Wje\u010dnik<\/a>, the Croatian Wiktionary, is a multilingual, openly accessible and openly editable dictionary<\/li>\n<li><a href=\"http:\/\/metashare.ilsp.gr:8080\/repository\/browse\/croatian-wordnet-v10\/32d93d48703d11e28a985ef2e4e6c59e166ec06132a740cbb36e515b093096b2\/\" target=\"_blank\" rel=\"noopener\">CroWN<\/a> is a lexical database for Croatian and <a href=\"https:\/\/www.ffzg.unizg.hr\/zzl\/racunalni_resursi_e.html\" target=\"_blank\" rel=\"noopener\">CroDeriV<\/a> is a morphological database of Croatian verbs<\/li>\n<li><a href=\"http:\/\/hdl.handle.net\/11356\/1232\" target=\"_blank\" rel=\"noopener\">hrLex<\/a> is the largest inflectional lexicon of Croatian language, consisting of 186,743 lexemes and 6,428,577 entries; it is searchable through the <a href=\"https:\/\/www.clarin.si\/services\/web\/login\" target=\"_blank\" rel=\"noopener\">CLARIN.SI web interface<\/a> (Anonymous login, Lexicon)<\/li>\n<li><a href=\"http:\/\/megahr.ffzg.unizg.hr\/en\/?page_id=609\" target=\"_blank\" rel=\"noopener\">Croatian Psycholinguistic Database<\/a> and <a href=\"http:\/\/polin-hlb.erf.hr\/Rijeci\/Pregled\" target=\"_blank\" rel=\"noopener\">Croatian Lexical Database HLB<\/a> provide psycholinguistic information on Croatian words (i.e., concreteness, imageability, frequency, and age of acquisition)<\/li>\n<li><span style=\"font-weight: 400;\"><a href=\"http:\/\/emocnet.uniri.hr\/congracnet2\/\" target=\"_blank\" rel=\"noopener\">Construction Grammar Conceptual Network CongraCNet<\/a> app from th<\/span>e project EmoCNET provides a way of analysing various semantic relations of concepts based on a network structure<\/li>\n<\/ul>\n<\/div>\n<h4 id=\"q12\">Q1.2: How can I\u00a0analyse Croatian corpora online?<\/h4>\n<p>CLARIN.SI offers access to four concordancers, which share the same set of corpora and back-end, but have different front-ends:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.clarin.si\/ske\/\" target=\"_blank\" rel=\"noopener noreferrer\">CLARIN.SI Crystal noSketch Engine<\/a>, an open-source variant of the well-known <a href=\"https:\/\/www.sketchengine.eu\" target=\"_blank\" rel=\"noopener\">Sketch Engine<\/a>.\u00a0 Instructions for its use are available\u00a0<a href=\"https:\/\/www.sketchengine.co.uk\/user-guide\/\">here<\/a>.\u00a0CLARIN.SI offers two installations of Crystal noSketch Engine: an <a href=\"https:\/\/www.clarin.si\/ske\/\" target=\"_blank\" rel=\"noopener\">open installation<\/a> (no log-in, which simplifies use for less advanced users) and a <a href=\"https:\/\/www.clarin.si\/skelog\" target=\"_blank\" rel=\"noopener\">version with log-in<\/a> which allows subcorpus creation and personalised display of e.g. corpus attributes.<\/li>\n<li><a href=\"https:\/\/www.clarin.si\/kontext\/\" target=\"_blank\" rel=\"noopener\">KonText<\/a>, with a somewhat different user interface. Basic functionality is provided without logging in, but to use more advanced functionalities, it is necessary to log in via your home institution.<\/li>\n<li><a href=\"https:\/\/www.clarin.si\/noske\/\" target=\"_blank\" rel=\"noopener noreferrer\">CLARIN.SI Bonito noSketch Engine<\/a> is the old version of noSketch Engine with a radically different user interface from Crystal. This version offers some functions that the new noSketch Engine does not, in particular, accessing the results of queries in XML, where it is enough to add the parameter \u201cformat=XML\u201d to the end of the query URL.<\/li>\n<\/ul>\n<p>Documentation on how to query corpora via the SketchEngine-like interfaces is available <a href=\"https:\/\/www.sketchengine.eu\/documentation\/corpus-querying\/\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<p>Note that the commercial Sketch Engine also offers access to several <a href=\"https:\/\/www.sketchengine.eu\/corpora-and-languages\/croatian-text-corpora\/\" target=\"_blank\" rel=\"noopener\">Croatian language corpora<\/a>, as well as\u00a0some additional tools that are not accessible on the free NoSketch Engine, including the tools to analyse collocations (<a href=\"https:\/\/www.sketchengine.eu\/guide\/word-sketch-collocations-and-word-combinations\/\" target=\"_blank\" rel=\"noopener\">Word sketches<\/a>), synonyms and antonyms (<a href=\"https:\/\/www.sketchengine.eu\/guide\/thesaurus-synonyms-antonyms-similar-words\/\" target=\"_blank\" rel=\"noopener\">Thesaurus<\/a>), the tools to compute frequency lists of multiword expressions (<a href=\"https:\/\/www.sketchengine.eu\/guide\/n-grams-multiword-expressions\/\" target=\"_blank\" rel=\"noopener\">N-grams<\/a>) and to extract <a href=\"https:\/\/www.sketchengine.eu\/guide\/keywords-and-term-extraction\/\" target=\"_blank\" rel=\"noopener\">keywords and terms<\/a>. It also allows users to create their own corpora.<\/p>\n<h4>Q1.3: Which Croatian corpora can I analyse\u00a0online?<\/h4>\n<p>For a complete list of corpora available under CLARIN.SI concordancers, see the index for <a href=\"https:\/\/www.clarin.si\/ske\/#open\" target=\"_blank\" rel=\"noopener\">Crystal noSkE<\/a>, <a href=\"https:\/\/www.clarin.si\/noske\/index.html\" target=\"_blank\" rel=\"noopener\">Bonito noSkE<\/a> or <a href=\"https:\/\/www.clarin.si\/kontext\/\" target=\"_blank\" rel=\"noopener\">KonText<\/a>. Below we list the Croatian ones, with links to the Crystal noSketch Engine concordancer:<\/p>\n<ul>\n<li><em>general language corpora<\/em> are the web corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_hr\" target=\"_blank\" rel=\"noopener\">CLASSLA-web.hr<\/a> (2.7 billion tokens), <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=hrwac\" target=\"_blank\" rel=\"noopener\">hrWaC<\/a> (1.4 billion tokens), the <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=riznica\" target=\"_blank\" rel=\"noopener\">Riznica Croatian Language Corpus<\/a> (100 million tokens) of the <a href=\"http:\/\/www.ihjj.hr\" target=\"_blank\" rel=\"noopener\">Institute for Croatian language and linguistics<\/a>, which consists of literary works and newspaper texts and which you can query via its <a href=\"http:\/\/riznica.ihjj.hr\/index.en.html\" target=\"_blank\" rel=\"noopener\">specialized interface<\/a> as well, and <a href=\"http:\/\/filip.ffzg.hr\/cgi-bin\/run.cgi\/corp_info?corpname=HNK_v30\" target=\"_blank\" rel=\"noopener\">The Croatian National Corpus<\/a> (HNK) of the <a href=\"https:\/\/www.ffzg.hr\/zzl\" target=\"_blank\" rel=\"noopener\">Institute of Linguistics<\/a><\/li>\n<li><em>specialized corpora<\/em> include the parliamentary corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlamint41_hr\" target=\"_blank\" rel=\"noopener\">ParlaMint-HR<\/a> and <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=yu1parl\" target=\"_blank\" rel=\"noopener\">yu1Parl<\/a>, the parliamentary spoken corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlaspeech_hr\" target=\"_blank\" rel=\"noopener\">ParlaSpeech-HR<\/a>, the Wikipedia corpus for Croatian, <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlawiki_hr\" target=\"_blank\" rel=\"noopener\">CLASSLAWiki-hr<\/a>, and Serbo-Croatian, <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlawiki_sh\" target=\"_blank\" rel=\"noopener\">CLASSLAWiki-sh<\/a>, the corpus of news portals <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=engri2\" target=\"_blank\" rel=\"noopener\">ENGRI<\/a>, and the corpus of tweets <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=tweet_hr\" target=\"_blank\" rel=\"noopener\">Tweet-hr<\/a><\/li>\n<li><em>manually annotated corpora<\/em> include the training corpus of standard language <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=hr500k\" target=\"_blank\" rel=\"noopener\">hr500k<\/a>, the corpora of non-professional written language <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=raput_cln\" target=\"_blank\" rel=\"noopener\">Raput-cln<\/a> (speakers with language disorders) and <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=raput_ncln\" target=\"_blank\" rel=\"noopener\">Raput-ncln<\/a> (typical speakers), and the training corpus of computer-mediated communication <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=reldi_hr\" target=\"_blank\" rel=\"noopener\">ReLDI-hr<\/a> with manually normalised (standardised), morphosyntactically tagged and lemmatised words and named entities<\/li>\n<li><em>parallel corpora<\/em> include the multilingual European parliamentary corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlamint41_xx\" target=\"_blank\" rel=\"noopener\">ParlaMint-XX<\/a>, paired with the machine-translated English corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlamint41_xx_en\" target=\"_blank\" rel=\"noopener\">ParlaMint-XX-en<\/a>, and the multilingual DGT translation memory corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=dgtud_hr\" target=\"_blank\" rel=\"noopener\">EU DGT-UD: Croatian<\/a><\/li>\n<\/ul>\n<p>In addition to these,\u00a0the <a href=\"https:\/\/ca.talkbank.org\/access\/Croatian.html\" target=\"_blank\" rel=\"noopener\">Croatian Spoken Language Corpus HrAL<\/a> is available through <a href=\"https:\/\/sla.talkbank.org\/TBB\/ca\/Croatian\" target=\"_blank\" rel=\"noopener\">TalkBank<\/a>. The latter platform also offers access to a small language development corpus of three participants, the <a href=\"https:\/\/childes.talkbank.org\/access\/Slavic\/Croatian\/Kovacevic.html\" target=\"_blank\" rel=\"noopener\">Kova\u010devi\u0107 Corpus<\/a>, the Croatian part of the comparable corpora <a href=\"https:\/\/childes.talkbank.org\/\" target=\"_blank\" rel=\"noopener\">CHILDES<\/a>, which consists of transcripts of child language for 24 languages. Furthermore, the <a href=\"http:\/\/teitok.clul.ul.pt\/croltec\/index.php?action=home\" target=\"_blank\" rel=\"noopener\">CroLTeC<\/a> corpus, a learner corpus of Croatian, can be queried via the <a href=\"http:\/\/teitok.clul.ul.pt\/croltec\/index.php?action=cqp\" target=\"_blank\" rel=\"noopener\">TeiTok<\/a> interface.<\/p>\n<p>Furthermore, the commercial <a href=\"https:\/\/www.sketchengine.eu\" target=\"_blank\" rel=\"noopener\">Sketch Engine<\/a>\u00a0includes the <a href=\"https:\/\/www.sketchengine.eu\/corpora-and-languages\/croatian-text-corpora\/\" target=\"_blank\" rel=\"noopener\">following Croatian corpora<\/a>: <a href=\"https:\/\/www.sketchengine.eu\/eurlex-corpus\/\" target=\"_blank\" rel=\"noopener\">EUR-Lex Croatian<\/a> 2\/2016 and <a href=\"https:\/\/www.sketchengine.eu\/opus-parallel-corpora\/\" target=\"_blank\" rel=\"noopener\">OPUS2 Croatian<\/a>, which is a part of the parallel corpus of 40 languages.<\/p>\n<h4>Q1.4: What linguistic annotation schemas are used in Croatian corpora?<\/h4>\n<p>Most of these corpora are annotated according to the <a href=\"https:\/\/nl.ijs.si\/ME\/\" target=\"_blank\" rel=\"noopener\">MULTEXT-East<\/a> morphosyntactic specifications. The more recent ones use the <a href=\"https:\/\/nl.ijs.si\/ME\/V6\/msd\/html\/msd-hbs.html\" target=\"_blank\" rel=\"noopener\">Version 6 specifications for the Serbo-Croatian macrolanguage<\/a>. More recent corpora also use the <a href=\"https:\/\/universaldependencies.org\/\" target=\"_blank\" rel=\"noopener\">Universal Dependencies<\/a> project annotation scheme, in particular that for <a href=\"https:\/\/universaldependencies.org\/hr\/index.html\" target=\"_blank\" rel=\"noopener\">Croatian and Serbian<\/a>. Named entities are annotated via the <a href=\"https:\/\/nl.ijs.si\/janes\/wp-content\/uploads\/2017\/09\/SlovenianNER-eng-v1.1.pdf\" target=\"_blank\" rel=\"noopener\">Janes NE guidelines<\/a>.<\/p>\n<h4 id=\"q15\">Q1.5: Where can I download Croatian resources?<\/h4>\n<p>The main point for archiving and downloading Croatian language resources is <a href=\"https:\/\/clarin.si\/repository\/xmlui\/\" target=\"_blank\" rel=\"noopener\">the repository of CLARIN.SI<\/a>.<\/p>\n<p>In addition to the resources mentioned above and below, the repository offers:<\/p>\n<ul>\n<li><em>manually annotated corpora and datasets<\/em>, including\u00a0the <a href=\"http:\/\/hdl.handle.net\/11356\/1342\" target=\"_blank\" rel=\"noopener\">Sentiment Annotated Dataset of Croatian News<\/a>, the multilingual sentiment dataset of parliamentary debates <a href=\"http:\/\/hdl.handle.net\/11356\/1868\" target=\"_blank\" rel=\"noopener\">ParlaSent<\/a>, the offensive language dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1462\" target=\"_blank\" rel=\"noopener\">FRENK<\/a>, annotated for different types of socially unacceptable discourse, and the commonsense reasoning datasets <a href=\"http:\/\/hdl.handle.net\/11356\/1404\" target=\"_blank\" rel=\"noopener\">COPA-HR<\/a> in Croatian and <a href=\"http:\/\/hdl.handle.net\/11356\/1766\" target=\"_blank\" rel=\"noopener\">DIALECT-COPA<\/a> in the Chakavian dialect<\/li>\n<li><em>parallel corpora<\/em>, including\u00a0the Croatian-English parallel corpora <a href=\"http:\/\/hdl.handle.net\/11356\/1814\" target=\"_blank\" rel=\"noopener\">MaCoCu-hr-en<\/a>, <a href=\"http:\/\/hdl.handle.net\/11356\/1058\" target=\"_blank\" rel=\"noopener\">hrenWaC<\/a> and the <a href=\"http:\/\/hdl.handle.net\/11356\/1049\" target=\"_blank\" rel=\"noopener\">Tourism Corpus<\/a><\/li>\n<li><em>other corpora and datasets<\/em>, including the largest Croatian corpus <b>\u2013<\/b>\u00a0the web corpus <a href=\"http:\/\/hdl.handle.net\/11356\/1806\" target=\"_blank\" rel=\"noopener\">MaCoCu-hr<\/a> with 2.4 billion words,\u00a0also available as a genre-enriched version inside the <a href=\"http:\/\/hdl.handle.net\/11356\/1969\" target=\"_blank\" rel=\"noopener\">MaCoCu-Genre<\/a> corpus collection, the linguistically annotated corpus of parliamentary debates <a href=\"http:\/\/hdl.handle.net\/11356\/1911\" target=\"_blank\" rel=\"noopener\">ParlaMint.ana<\/a>,\u00a0the automatic speech recognition training dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1914\" target=\"_blank\" rel=\"noopener\">ParlaSpeech-HR<\/a>, the 24sata <a href=\"http:\/\/hdl.handle.net\/11356\/1410\" target=\"_blank\" rel=\"noopener\">news article archive<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1399\" target=\"_blank\" rel=\"noopener\">news comment dataset<\/a>,\u00a0the multilingual IPTC news media topic dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1991\" target=\"_blank\" rel=\"noopener\">EMMediaTopic<\/a>,\u00a0the <a href=\"http:\/\/hdl.handle.net\/11356\/1054\" target=\"_blank\" rel=\"noopener\">Twitter corpus<\/a>,\u00a0the text collection for training the <a href=\"https:\/\/huggingface.co\/classla\/bcms-bertic\" target=\"_blank\" rel=\"noopener\">BERTi\u0107<\/a> transformer model <a href=\"http:\/\/hdl.handle.net\/11356\/1426\" target=\"_blank\" rel=\"noopener\">BERTi\u0107-data<\/a>, the <a href=\"http:\/\/hdl.handle.net\/11356\/1403\" target=\"_blank\" rel=\"noopener\">Keyword extraction dataset<\/a>, the news dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1461\" target=\"_blank\" rel=\"noopener\">SETimes.HBS<\/a>\u00a0and the Twitter dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1482\" target=\"_blank\" rel=\"noopener\">Twitter-HBS<\/a> for discriminating between Bosnian, Croatian, Montenegrin and Serbian, and the <a href=\"http:\/\/hdl.handle.net\/11356\/1765\" target=\"_blank\" rel=\"noopener\">Mi\u0107i Princ &#8220;text and speech&#8221; dialectal dataset<\/a> in various Chakavian micro-dialects<\/li>\n<li><em>wordlists and other lexical resources,\u00a0<\/em>including\u00a0the automatically constructed multiword lexicon <a href=\"http:\/\/hdl.handle.net\/11356\/1177\" target=\"_blank\" rel=\"noopener\">hrMWELex<\/a>, the verbal databases of the Western South Slavic <a href=\"http:\/\/hdl.handle.net\/11356\/1683\" target=\"_blank\" rel=\"noopener\">HyperVerb<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1846\" target=\"_blank\" rel=\"noopener\">WeSoSlaV<\/a>, and the <a href=\"http:\/\/hdl.handle.net\/11356\/1318\" target=\"_blank\" rel=\"noopener\">LiLaH<\/a> emotion lexicon<\/li>\n<\/ul>\n<p>Another point where you can find Croatian resources is the <a href=\"http:\/\/metashare.ilsp.gr:8080\/repository\/search\/?q=&amp;selected_facets=languageNameFilter_exact%3ACroatian\" target=\"_blank\" rel=\"noopener\">MetaShare repository<\/a>, which includes the sentiment lexicon <a href=\"http:\/\/metashare.ilsp.gr:8080\/repository\/browse\/croatian-sentiment-lexicon\/940fe19e6c6d11e28a985ef2e4e6c59eff8b12d75f284d58aacfa8d732467509\/\" target=\"_blank\" rel=\"noopener\">CroSentilex<\/a>, the valency lexicon <a href=\"http:\/\/metashare.ilsp.gr:8080\/repository\/browse\/croatian-valency-lexicon\/1ebc2bf4703d11e28a985ef2e4e6c59e8cead57cc3314c4bb12a87eb058428bd\/\" target=\"_blank\" rel=\"noopener\">CROVALLEX<\/a>, and the South-East European Parallel Corpus <a href=\"http:\/\/metashare.ilsp.gr:8080\/repository\/browse\/south-east-european-parallel-corpus\/d200935e67cc11e28a985ef2e4e6c59ef6e70e681f7745a191deeb0b0537e60a\/\" target=\"_blank\" rel=\"noopener\">SETimes Corpus<\/a>.<\/p>\n<p>In addition to this, some Croatian language resources can be downloaded from the <a href=\"https:\/\/repository.pfri.uniri.hr\/en\" target=\"_blank\" rel=\"noopener\">Repository of the Faculty of Maritime Studies of University of Rijeka (FMSRI)<\/a>, such as the <a href=\"https:\/\/repository.pfri.uniri.hr\/islandora\/object\/pfri:2518\" target=\"_blank\" rel=\"noopener\">Database of English words and their Croatian equivalents<\/a>, the <a href=\"https:\/\/repository.pfri.uniri.hr\/islandora\/object\/pfri:2495\" target=\"_blank\" rel=\"noopener\">Database of English words in Croatian<\/a>, and the <a href=\"https:\/\/repository.pfri.uniri.hr\/islandora\/object\/pfri:2614\" target=\"_blank\" rel=\"noopener\">CROWD-5e<\/a> database, a Croatian psycholinguistic database of affective norms for five discrete emotions.<\/p>\n<p>Moreover, scientific publications in Croatian language are available as part of the scientific corpus <a href=\"https:\/\/huggingface.co\/datasets\/procesaur\/ZNANJE\" target=\"_blank\" rel=\"noopener\">ZNANJE<\/a> on the Hugging Face repository. In addition to Slovenian and Serbian publications, a large part of the ZNANJE corpus comprises Croatian scientific publications that were collected from the <a href=\"https:\/\/dabar.srce.hr\/en\/dabar\" target=\"_blank\" rel=\"noopener\">Croatian Digital Academic Archives and Repositories (DABAR)<\/a> service.<\/p>\n<hr class=\"shortcode hr blue\" style=\"width:100%;border-width:3px;\" \/>\n<h2 id=\"processing\">2. Tools to annotate Croatian texts<\/h2>\n<h4 id=\"q21\">Q2.1: How can I perform basic linguistic processing of my Croatian texts?<\/h4>\n<p>The state-of-the-art <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> provides processing of standard and non-standard (Internet) Croatian on the levels of tokenisation and sentence splitting, part-of-speech tagging, lemmatisation, dependency parsing, and named entity recognition. For Croatian, the CLASSLA-Stanza pipeline uses the rule-based <a href=\"https:\/\/github.com\/clarinsi\/reldi-tokeniser\" target=\"_blank\" rel=\"noopener\">reldi-tokeniser<\/a>. There are also available off-the-shelf models for lemmatisation of <a href=\"http:\/\/hdl.handle.net\/11356\/1829\" target=\"_blank\" rel=\"noopener\">standard<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1827\" target=\"_blank\" rel=\"noopener\">non-standard<\/a> Croatian, and part-of-speech tagging of <a href=\"http:\/\/hdl.handle.net\/11356\/1832\" target=\"_blank\" rel=\"noopener\">standard<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1826\" target=\"_blank\" rel=\"noopener\">non-standard<\/a> Croatian. You can try out the pipeline at the <a href=\"https:\/\/clarin.si\/oznacevalnik\/eng\" target=\"_blank\" rel=\"noopener\">CLASSLA Annotation tool<\/a> website.<\/p>\n<p>The documentation for the installation and use of the pipeline is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<p>In addition to this, tokenisation, part-of-speech tagging, and lemmatisation are provided by a CLARIN.SI service <a href=\"https:\/\/clarin.si\/services\/web\/\" target=\"_blank\" rel=\"noopener\">ReLDIanno<\/a> as well. This is a legacy system for linguistic annotation that we still keep available for backward compatibility, but we suggest new users to use the above-mentioned CLASSLA-Stanza pipeline.<\/p>\n<h4 id=\"q22\">Q2.2: How can I standardize my texts prior to further processing?<\/h4>\n<p>The <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a>, mentioned above, includes also models for processing of non-standard text, which allows non-standard texts to be annotated before previous standardization.<\/p>\n<p>Currently, the only on-line text normalisation tool available through the CLARIN.SI services (ReLDIanno) is the <a href=\"https:\/\/www.clarin.si\/services\/web\/login\" target=\"_blank\" rel=\"noopener\">REDI diacritic restorer<\/a>. Its usage is documented <a href=\"https:\/\/www.clarin.si\/info\/k-centre\/web-services-documentation\/\">here<\/a>. You can also <a href=\"https:\/\/github.com\/clarinsi\/redi\" target=\"_blank\" rel=\"noopener\">download<\/a> it, install it and use it locally.<\/p>\n<p>For word-level normalisation of user-generated Croatian texts, you can download and install the <a href=\"https:\/\/github.com\/clarinsi\/csmtiser\" target=\"_blank\" rel=\"noopener\">CSMTiser text normaliser<\/a>.<\/p>\n<h4 id=\"q23\">Q2.3: How can I annotate my texts for named entities?<\/h4>\n<p>Named entity recognition is provided by the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a>, which also offers off-the shelf models for <a href=\"http:\/\/hdl.handle.net\/11356\/1322\" target=\"_blank\" rel=\"noopener\">standard<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1340\" target=\"_blank\" rel=\"noopener\">non-standard<\/a> Croatian. In addition to this, on-line NER is available via the CLARIN.SI service <a href=\"http:\/\/www.clarin.si\/info\/k-centre\/web-services-documentation\/\">ReLDIanno<\/a>. You can also download the <a href=\"https:\/\/github.com\/clarinsi\/janes-ner\" target=\"_blank\" rel=\"noopener\">janes-ner<\/a> tool.<\/p>\n<h4 id=\"q24\">Q2.4: How can I syntactically parse my texts?<\/h4>\n<p>You can syntactically parse Croatian texts, following the <a href=\"https:\/\/universaldependencies.org\/u\/overview\/syntax.html\" target=\"_blank\" rel=\"noopener\">Universal Dependencies formalism<\/a>, in multiple ways:<\/p>\n<ul>\n<li>by using the state-of-the-art <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a>, which also offers an <a href=\"http:\/\/hdl.handle.net\/11356\/1836\" target=\"_blank\" rel=\"noopener\">off-the-shelf model<\/a><\/li>\n<li>by using the CLARIN.SI service <a href=\"http:\/\/www.clarin.si\/info\/k-centre\/web-services-documentation\/\">ReLDIanno<\/a><\/li>\n<li>by using the <a href=\"https:\/\/ufal.mff.cuni.cz\/udpipe\" target=\"_blank\" rel=\"noopener\">UDPipe tool<\/a>, which has off-the-shelf models for many languages, Croatian included<\/li>\n<\/ul>\n<hr class=\"shortcode hr blue\" style=\"width:100%;border-width:3px;\" \/>\n<h2 id=\"training\">3. Datasets to train Croatian annotation tools<\/h2>\n<h4 id=\"q31\">Q3.1: Where can I get word embeddings or pre-trained language models for Croatian?<\/h4>\n<ul>\n<li>The embeddings trained on the largest collection of Croatian textual data (hrWaC, Riznica, 24sata newspaper texts and comments, MaCoCu-hr, etc.) is the <a href=\"http:\/\/hdl.handle.net\/11356\/1790\" target=\"_blank\" rel=\"noopener\">CLARIN.SI-embed.hr<\/a> embedding collection.<\/li>\n<li>There are also collections of trained embeddings for Croatian available from\u00a0<a href=\"https:\/\/fasttext.cc\/docs\/en\/crawl-vectors.html\" target=\"_blank\" rel=\"noopener\">fastText<\/a>.<\/li>\n<li>If you want to train your own embeddings, the largest freely available collection of Croatian texts is the <a href=\"http:\/\/hdl.handle.net\/11356\/1426\" target=\"_blank\" rel=\"noopener\">BERTi\u0107-data<\/a> text collection.<\/li>\n<\/ul>\n<p>You can also use a transformer language model\u00a0<a href=\"https:\/\/huggingface.co\/classla\/bcms-bertic\" target=\"_blank\" rel=\"noopener\">BERTi\u0107<\/a>, a state-of-the-art model representing words\/tokens as contextually dependent word embeddings. It allows you to extract word embeddings for every word occurrence, which can then be used in training a model for an end task.<\/p>\n<h4 id=\"q32\">Q3.2: What data is available for training a text normaliser for Croatian?<\/h4>\n<p>For training text normalisers for Internet Croatian, the <a href=\"http:\/\/hdl.handle.net\/11356\/1793\" target=\"_blank\" rel=\"noopener\">ReLDI-NormTagNER-hr<\/a> dataset can be used, a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Croatian.<\/p>\n<h4 id=\"q33\">Q3.3: What data is available for training a part-of-speech tagger for Croatian?<\/h4>\n<p>The reference dataset for training a standard tagger is <a href=\"http:\/\/hdl.handle.net\/11356\/1792\" target=\"_blank\" rel=\"noopener\">hr500k<\/a>.\u00a0There is also the <a href=\"http:\/\/hdl.handle.net\/11356\/1793\" target=\"_blank\" rel=\"noopener\">ReLDI-NormTagNER-hr<\/a>\u00a0training dataset of Internet Croatian.<\/p>\n<p>You can also use the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> in combination with the <a href=\"http:\/\/hdl.handle.net\/11356\/1790\" target=\"_blank\" rel=\"noopener\">CLARIN.SI embeddings<\/a> and the training dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1792\" target=\"_blank\" rel=\"noopener\">hr500k<\/a> to train and evaluate your own part-of-speech tagger. The documentation is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md#part-of-speech-tagging-1\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h4 id=\"q34\">Q3.4: What data is available for training a lemmatiser for Croatian?<\/h4>\n<p>Lemmatisers can be trained either on the tagger training data (<a href=\"http:\/\/hdl.handle.net\/11356\/1792\" target=\"_blank\" rel=\"noopener\">hr500k<\/a>, <a href=\"http:\/\/hdl.handle.net\/11356\/1793\" target=\"_blank\" rel=\"noopener\">ReLDI-NormTagNER-hr<\/a>, see the section on PoS tagger training for details) and\/or on the inflectional lexicon <a href=\"http:\/\/hdl.handle.net\/11356\/1232\" target=\"_blank\" rel=\"noopener\">hrLex<\/a>.<\/p>\n<p>For training your own lemmatiser for standard and non-standard Croatian, you can use the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a>, which uses the external lexicon for lemmatisation (<a href=\"http:\/\/hdl.handle.net\/11356\/1232\" target=\"_blank\" rel=\"noopener\">hrLex<\/a>). The documentation is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md#lemmatization\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h4 id=\"q35\">Q3.5: What data is available for training a named entity recogniser for Croatian?<\/h4>\n<p>For training a named entity recogniser of standard language, <a href=\"http:\/\/hdl.handle.net\/11356\/1792\" target=\"_blank\" rel=\"noopener\">hr500k<\/a> is the best resource. For training NER systems for online, non-standard texts, <a href=\"http:\/\/hdl.handle.net\/11356\/1793\" target=\"_blank\" rel=\"noopener\">ReLDI-NormTagNER-hr<\/a> can be used.<\/p>\n<p>The <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> allows you to train your own named entity recogniser as well. The documentation is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md#ner-1\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h4 id=\"q36\">Q3.6: What data is available for training a syntactic parser for Croatian?<\/h4>\n<p>If you want to follow the <a href=\"https:\/\/universaldependencies.org\/u\/overview\/syntax.html\" target=\"_blank\" rel=\"noopener\">Universal Dependencies formalism<\/a> for dependency parsing, the best location for obtaining training data is the <a href=\"https:\/\/github.com\/UniversalDependencies\/UD_Croatian-SET\" target=\"_blank\" rel=\"noopener\">Universal Dependencies repository<\/a>.<\/p>\n<p>If you require additional annotation layers, e.g., for multi-task learning, the <a href=\"http:\/\/hdl.handle.net\/11356\/1792\" target=\"_blank\" rel=\"noopener\">hr500k<\/a> dataset should be used.<\/p>\n<p>You can also use the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> to train your own parser. The documentation is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md#parsing-1\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n\n<div id=\"themify_builder_content-3566\" data-postid=\"3566\" class=\"themify_builder_content themify_builder_content-3566 themify_builder\">\n    <\/div>\n<!-- \/themify_builder_content -->\n","protected":false},"excerpt":{"rendered":"<p>This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on helpdesk.classla@clarin.si, Subject &#8220;FAQ_Croatian&#8221;. The questions in this FAQ are organised into three main sections: 1. Online Croatian language resources Q1.1: Where can I find [&hellip;]<\/p>\n","protected":false},"author":9,"featured_media":0,"parent":3558,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-3566","page","type-page","status-publish","hentry","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3566","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/9"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=3566"}],"version-history":[{"count":124,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3566\/revisions"}],"predecessor-version":[{"id":7898,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3566\/revisions\/7898"}],"up":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3558"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=3566"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}