{"id":3568,"date":"2019-03-20T13:36:25","date_gmt":"2019-03-20T13:36:25","guid":{"rendered":"http:\/\/www.clarin.si\/info\/?page_id=3568"},"modified":"2025-01-08T08:01:43","modified_gmt":"2025-01-08T08:01:43","slug":"faq4serbian","status":"publish","type":"page","link":"https:\/\/www.clarin.si\/info\/k-centre\/faq4serbian\/","title":{"rendered":"FAQ for Serbian language resources and technologies"},"content":{"rendered":"<p>This FAQ is part of the documentation of the <a href=\"..\/\">CLASSLA<\/a> CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on <a href=\"mailto:helpdesk.classla@clarin.si?subject=FAQ_Serbian\">helpdesk.classla@clarin.si<\/a>, Subject &#8220;FAQ_Serbian&#8221;.<\/p>\n<p>The questions in this FAQ are organised into three main sections:<\/p>\n\n<h2 id=\"existing\">1. Online Serbian language resources<\/h2>\n<h4 id=\"q11\">Q1.1: Where can I find Serbian dictionaries?<\/h4>\n<p>Below we list the main lexical resources:<\/p>\n<ul>\n<li id=\"tab-tab-2860-0-0-6-2860-0\" class=\"tab-content\"><a href=\"http:\/\/www.srpskijezik.com\/Home\/Index\" target=\"_blank\" rel=\"noopener\">Srpski jezik<\/a> dictionary portal includes general dictionaries, orthographic and terminological dictionaries, including <a href=\"https:\/\/www.isj.sanu.ac.rs\/gb\/projekti\/recnik_sanu\/\" target=\"_blank\" rel=\"noopener\">SASA<\/a>, the official dictionary of the Serbo-Croatian language published by the Serbian Academy of Sciences and Arts. They can be accessed upon registration.<\/li>\n<li class=\"tab-content\"><a href=\"http:\/\/staznaci.com\/\" target=\"_blank\" rel=\"noopener\">\u0161taZna\u010di.com<\/a> is an online dictionary which is freely accessible<\/li>\n<li><a href=\"https:\/\/raskovnik.org\" target=\"_blank\" rel=\"noopener\">Raskovnik<\/a> is a dictionary portal aimed at providing access to historical dictionaries of Serbian language. Registered users can access five dictionaries with around 84 thousand entries altogether.<\/li>\n<li><a href=\"https:\/\/dais.sanu.ac.rs\/\" target=\"_blank\" rel=\"noopener\">Digital Archive of the Serbian Academy of Sciences and Arts<\/a> (DAIS) provides open access to some digitised specialised dictionaries<\/li>\n<li><a href=\"https:\/\/www.kontekst.io\/srpski\" target=\"_blank\" rel=\"noopener\">Kontekst.io<\/a> is a lexicon of semantically related words, automatically produced on the basis of word-embeddings from large corpora<\/li>\n<li><a href=\"https:\/\/sr.wiktionary.org\/wiki\/\" target=\"_blank\" rel=\"noopener\">Vikire\u010dnik<\/a>, the Serbian Wiktionary, is a multilingual, openly accessible, and openly editable dictionary<\/li>\n<li><a href=\"http:\/\/www.korpus.matf.bg.ac.rs\/SrpMD\/\" target=\"_blank\" rel=\"noopener\">Morphological electronic dictionary of Serbian<\/a> (Ekavian pronunciation) and the Serbian WordNet <a href=\"http:\/\/korpus.matf.bg.ac.rs\/SrpWN\/\" target=\"_blank\" rel=\"noopener\">SrpWN<\/a> can be requested via the MetaShare repository<\/li>\n<li><a href=\"http:\/\/hdl.handle.net\/11356\/1233\" target=\"_blank\" rel=\"noopener\">srLex<\/a> is the largest inflectional lexicon of Serbian language, consisting of 192,590 lexemes and 6,908,043 entries; it is searchable through the <a href=\"https:\/\/www.clarin.si\/services\/web\/login\" target=\"_blank\" rel=\"noopener\">CLARIN.SI web interface<\/a> (Anonymous login, Lexicon)<\/li>\n<\/ul>\n<h4 id=\"q12\">Q1.2: How can I\u00a0analyse Serbian corpora online?<\/h4>\n<p>CLARIN.SI offers access to four concordancers, which share the same set of corpora and back-end, but have different front-ends:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.clarin.si\/ske\/\" target=\"_blank\" rel=\"noopener noreferrer\">CLARIN.SI Crystal noSketch Engine<\/a>, an open-source variant of the well-known <a href=\"https:\/\/www.sketchengine.eu\" target=\"_blank\" rel=\"noopener\">Sketch Engine<\/a>.\u00a0 Instructions for its use are available\u00a0<a href=\"https:\/\/www.sketchengine.co.uk\/user-guide\/\">here<\/a>. CLARIN.SI offers two installations of Crystal noSketch Engine: an <a href=\"https:\/\/www.clarin.si\/ske\/\" target=\"_blank\" rel=\"noopener\">open installation<\/a> (no log-in, which simplifies use for less advanced users) and a <a href=\"https:\/\/www.clarin.si\/skelog\" target=\"_blank\" rel=\"noopener\">version with log-in<\/a> which allows subcorpus creation and personalised display of e.g. corpus attributes.<\/li>\n<li><a href=\"https:\/\/www.clarin.si\/kontext\/\" target=\"_blank\" rel=\"noopener\">KonText<\/a>, with a somewhat different user interface. Basic functionality is provided without logging in, but to use more advanced functionalities, it is necessary to log in via your home institution.<\/li>\n<li><a href=\"https:\/\/www.clarin.si\/noske\/\" target=\"_blank\" rel=\"noopener noreferrer\">CLARIN.SI Bonito noSketch Engine<\/a> is the old version of noSketch Engine with a radically different user interface from Crystal. This version offers some functions that the new noSketch Engine does not, in particular, accessing the results of queries in XML, where it is enough to add the parameter \u201cformat=XML\u201d to the end of the query URL.<\/li>\n<\/ul>\n<p>Documentation on how to query corpora via the SketchEngine-like interfaces is available <a href=\"https:\/\/www.sketchengine.eu\/documentation\/corpus-querying\/\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<p>Note that the commercial <a href=\"https:\/\/www.sketchengine.eu\/corpora-and-languages\/serbian-text-corpora\/\" target=\"_blank\" rel=\"noopener\">Sketch Engine<\/a> also offers access to several Serbian language corpora, as well as\u00a0some additional tools that are not accessible on the free NoSketch Engine, including the tools to analyse collocations (<a href=\"https:\/\/www.sketchengine.eu\/guide\/word-sketch-collocations-and-word-combinations\/\" target=\"_blank\" rel=\"noopener\">Word sketches<\/a>), synonyms and antonyms (<a href=\"https:\/\/www.sketchengine.eu\/guide\/thesaurus-synonyms-antonyms-similar-words\/\" target=\"_blank\" rel=\"noopener\">Thesaurus<\/a>), and the tools to compute frequency lists of multiword expressions (<a href=\"https:\/\/www.sketchengine.eu\/guide\/n-grams-multiword-expressions\/\" target=\"_blank\" rel=\"noopener\">N-grams<\/a>). It also allows users to create their own corpora.<\/p>\n<h4>Q1.3: Which Serbian corpora can I analyse\u00a0online?<\/h4>\n<p>For a complete list of corpora available under CLARIN.SI concordancers, see the index for <a href=\"https:\/\/www.clarin.si\/ske\/#open\" target=\"_blank\" rel=\"noopener\">Crystal noSkE<\/a>, <a href=\"https:\/\/www.clarin.si\/noske\/index.html\" target=\"_blank\" rel=\"noopener\">Bonito noSkE<\/a> or <a href=\"https:\/\/www.clarin.si\/kontext\/\" target=\"_blank\" rel=\"noopener\">KonText<\/a>. Below we list some of the important ones, with links to the Crystal noSketch Engine concordancer:<\/p>\n<ul>\n<li><em>general language corpora<\/em> that are available under CLARIN.si concordancers are the web corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sr\" target=\"_blank\" rel=\"noopener\">CLASSLA-web.sr<\/a> (2.9 billion tokens), <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=pdrs10\" target=\"_blank\" rel=\"noopener\">PDRS<\/a> (715 million tokens) and <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=srwac\" target=\"_blank\" rel=\"noopener\">srWaC<\/a> (554 million tokens)<\/li>\n<li><em>specialized corpora<\/em> include\u00a0the parliamentary corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlamint41_rs\" target=\"_blank\" rel=\"noopener\">ParlaMint-RS<\/a> and <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=yu1parl\" target=\"_blank\" rel=\"noopener\">yu1Parl<\/a>, the parliamentary spoken corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlaspeech_rs\" target=\"_blank\" rel=\"noopener\">ParlaSpeech-RS<\/a>, the Wikipedia corpus for Serbian, <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlawiki_sr\" target=\"_blank\" rel=\"noopener\">CLASSLAWiki-sr<\/a>, and Serbo-Croatian, <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlawiki_sh\" target=\"_blank\" rel=\"noopener\">CLASSLAWiki-sh<\/a>, the corpus of tweets <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=tweet_sr\" target=\"_blank\" rel=\"noopener\">Tweet-sr<\/a>, the <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=sfa_sr\" target=\"_blank\" rel=\"noopener\">Corpus of Serbian forms of address<\/a>, and the <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=torlak\" target=\"_blank\" rel=\"noopener\">Corpus of Torlak dialect transcriptions<\/a><\/li>\n<li><em>manually annotated corpora<\/em> include the training corpus of standard language <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=setimes_sr\" target=\"_blank\" rel=\"noopener\">SETimes.SR<\/a> and the training corpus of computer-mediated communication <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=reldi_sr\" target=\"_blank\" rel=\"noopener\">ReLDI-sr<\/a> with manually normalised (standardised), morphosyntactically tagged and lemmatised words and named entities<\/li>\n<li><em>parallel corpora<\/em> include the multilingual European parliamentary corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlamint41_xx\" target=\"_blank\" rel=\"noopener\">ParlaMint-XX<\/a>, paired with the machine-translated English corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlamint41_xx_en\" target=\"_blank\" rel=\"noopener\">ParlaMint-XX-en<\/a><\/li>\n<\/ul>\n<p>In addition to these, the general Corpus of Contemporary Serbian SrpKor2013 (122 million tokens) of the <a href=\"http:\/\/www.matf.bg.ac.rs\/eng\/\" target=\"_blank\" rel=\"noopener\">Faculty of Mathematics, University of Belgrade<\/a>, which consists of literary, scientific, administrative, and general texts, is available for search through the <a href=\"https:\/\/cwb.sourceforge.io\/\" target=\"_blank\" rel=\"noopener\">IMS corpus workbench<\/a>, but it requires authentication. Another specialised corpus, the <a href=\"https:\/\/childes.talkbank.org\/access\/Slavic\/Serbian\/SCECL.html\" target=\"_blank\" rel=\"noopener\">Serbian Corpus of Early Child Language<\/a> (SCECL), can be downloaded and browsed via <a href=\"https:\/\/sla.talkbank.org\/TBB\/childes\/Slavic\/Serbian\/SCECL\" target=\"_blank\" rel=\"noopener\">TalkBank<\/a>.<\/p>\n<p>Furthermore, the commercial <a href=\"https:\/\/www.sketchengine.eu\" target=\"_blank\" rel=\"noopener\">Sketch Engine<\/a>\u00a0includes some Serbian corpora in <a href=\"https:\/\/www.sketchengine.eu\/corpora-and-languages\/serbian-latin--text-corpora\/\" target=\"_blank\" rel=\"noopener\">Latin<\/a> and <a href=\"https:\/\/www.sketchengine.eu\/corpora-and-languages\/serbian-text-corpora\/\" target=\"_blank\" rel=\"noopener\">Cyrillic<\/a> script.<\/p>\n<h4>Q1.4: What linguistic annotation schemas are used in Serbian corpora?<\/h4>\n<p>Most of these corpora are annotated according to the <a href=\"https:\/\/nl.ijs.si\/ME\/\" target=\"_blank\" rel=\"noopener\">MULTEXT-East<\/a> morphosyntactic specifications. The more recent ones use the <a href=\"https:\/\/nl.ijs.si\/ME\/V6\/msd\/html\/msd-hbs.html\" target=\"_blank\" rel=\"noopener\">Version 6 specifications for the Serbo-Croatian macrolanguage<\/a>. More recent corpora also use the <a href=\"https:\/\/universaldependencies.org\/\" target=\"_blank\" rel=\"noopener\">Universal Dependencies<\/a> project annotation scheme, in particular that for <a href=\"https:\/\/universaldependencies.org\/sr\/index.html\" target=\"_blank\" rel=\"noopener\">Croatian and Serbian<\/a>. Named entities are annotated via the <a href=\"https:\/\/nl.ijs.si\/janes\/wp-content\/uploads\/2017\/09\/SlovenianNER-eng-v1.1.pdf\" target=\"_blank\" rel=\"noopener\">Janes NE guidelines<\/a>.<\/p>\n<h4 id=\"q15\">Q1.5: Where can I download Serbian resources?<\/h4>\n<p>The main point for archiving and downloading Serbian language resources is <a href=\"https:\/\/clarin.si\/repository\/xmlui\/\" target=\"_blank\" rel=\"noopener\">the repository of CLARIN.SI<\/a>.<\/p>\n<p>In addition to the resources mentioned above and below, the repository offers:<\/p>\n<ul>\n<li><em>manually annotated corpora and datasets<\/em>, including the sentiment-annotated <a href=\"http:\/\/hdl.handle.net\/11356\/1054\" target=\"_blank\" rel=\"noopener\">Twitter corpora<\/a>, the multilingual sentiment dataset of parliamentary debates <a href=\"http:\/\/hdl.handle.net\/11356\/1868\" target=\"_blank\" rel=\"noopener\">ParlaSent<\/a>,\u00a0the post-edited and error annotated machine translation corpus <a href=\"http:\/\/hdl.handle.net\/11356\/1065\" target=\"_blank\" rel=\"noopener\">PErr<\/a>, and the choice of plausible alternatives datasets <a href=\"http:\/\/hdl.handle.net\/11356\/1708\" target=\"_blank\" rel=\"noopener\">COPA-SR<\/a> in Serbian and <a href=\"http:\/\/hdl.handle.net\/11356\/1766\" target=\"_blank\" rel=\"noopener\">DIALECT-COPA<\/a> in the Torlak dialect from southeastern Serbia;<\/li>\n<li><em>parallel corpora, <\/em>including the Serbian-English parallel corpora\u00a0<a href=\"http:\/\/hdl.handle.net\/11356\/1819\" target=\"_blank\" rel=\"noopener\">MaCoCu-sr-en<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1059\" target=\"_blank\" rel=\"noopener\">srenWaC<\/a>;<\/li>\n<li><em>wordlists and other lexical resources,\u00a0<\/em>including the automatically constructed multiword lexicon <a href=\"http:\/\/hdl.handle.net\/11356\/1178\" target=\"_blank\" rel=\"noopener\">srMWELex<\/a>, and the databases of the Western South Slavic verb <a href=\"http:\/\/hdl.handle.net\/11356\/1683\" target=\"_blank\" rel=\"noopener\">HyperVerb<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1846\" target=\"_blank\" rel=\"noopener\">WeSoSlaV<\/a>;<\/li>\n<li><em>other corpora and datasets,\u00a0<\/em>including the largest Serbian corpus <b>\u2013<\/b>\u00a0the web corpus <a href=\"http:\/\/hdl.handle.net\/11356\/1807\" target=\"_blank\" rel=\"noopener\">MaCoCu-sr<\/a> with 2.5 billion words,\u00a0also available as a genre-enriched version inside the <a href=\"http:\/\/hdl.handle.net\/11356\/1969\" target=\"_blank\" rel=\"noopener\">MaCoCu-Genre<\/a> corpus collection, the linguistically annotated corpus of parliamentary debates <a href=\"http:\/\/hdl.handle.net\/11356\/1911\" target=\"_blank\" rel=\"noopener\">ParlaMint.ana<\/a>, the text collection for training the <a href=\"https:\/\/huggingface.co\/classla\/bcms-bertic\" target=\"_blank\" rel=\"noopener\">BERTi\u0107<\/a> transformer model <a href=\"http:\/\/hdl.handle.net\/11356\/1426\" target=\"_blank\" rel=\"noopener\">BERTi\u0107-data<\/a>, the corpus of <a href=\"http:\/\/hdl.handle.net\/11356\/1754\" target=\"_blank\" rel=\"noopener\">Legislation texts of Republic of Serbia<\/a>, the automatic speech recognition training dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1679\" target=\"_blank\" rel=\"noopener\">JuzneVesti-SR<\/a>, the multilingual news sentiment analysis dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1987\" target=\"_blank\" rel=\"noopener\">SADEmma<\/a>, the news dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1461\" target=\"_blank\" rel=\"noopener\">SETimes.HBS<\/a>\u00a0and the Twitter dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1482\" target=\"_blank\" rel=\"noopener\">Twitter-HBS<\/a> for discriminating between Bosnian, Croatian, Montenegrin and Serbian.<\/li>\n<\/ul>\n<p>Another point where you can find Serbian resources is the <a href=\"https:\/\/live.european-language-grid.eu\/\" target=\"_blank\" rel=\"noopener\">European Language Grid<\/a> platform. It includes the public mining corpus <a href=\"https:\/\/doi.org\/10.57771\/gnjv-x420\" target=\"_blank\" rel=\"noopener\">RudKorP<\/a>,\u00a0the lexicon of sentiment-scored words <a href=\"https:\/\/doi.org\/10.57771\/5ey8-st50\" target=\"_blank\" rel=\"noopener\">SentiWords.SR<\/a>, and the morphological dictionaries <a href=\"https:\/\/doi.org\/10.57771\/j0ge-8e29\" target=\"_blank\" rel=\"noopener\">SrpMD<\/a>.<\/p>\n<p>Additional Serbian resources are available at the <a href=\"https:\/\/huggingface.co\/jerteh\" target=\"_blank\" rel=\"noopener\">Hugging Face profile<\/a> of the <a href=\"https:\/\/jerteh.rs\/index.php\/en\/\" target=\"_blank\" rel=\"noopener\">Serbian Language Resources and Technologies Society<\/a> (JERTEH), including the corpus of old Serbian novels <a href=\"https:\/\/huggingface.co\/datasets\/jerteh\/SrpELTeC\" target=\"_blank\" rel=\"noopener\">SrpELTeC<\/a>, the news corpus <a href=\"https:\/\/huggingface.co\/datasets\/jerteh\/SrpKorNews\" target=\"_blank\" rel=\"noopener\">SrpKorNews<\/a>, the parallel Serbian-English corpus of doctoral dissertation abstracts <a href=\"https:\/\/huggingface.co\/datasets\/jerteh\/PaSaz\" target=\"_blank\" rel=\"noopener\">PaSa\u017e<\/a>, and the named entity recognition training corpus <a href=\"https:\/\/huggingface.co\/datasets\/jerteh\/SrpELTeC-gold-NER\" target=\"_blank\" rel=\"noopener\">SrpELTeC-gold-NER<\/a>. Another dataset, available on the Hugging Face repository, is the Serbian scientific corpus <a href=\"https:\/\/huggingface.co\/datasets\/procesaur\/STARS\" target=\"_blank\" rel=\"noopener\">STARS<\/a>.<\/p>\n<p>Furthermore, a <a href=\"https:\/\/reldi.spur.uzh.ch\/resources-and-tools\/\" target=\"_blank\" rel=\"noopener\">list<\/a> of some additional resources has been compiled by the <a href=\"https:\/\/reldi.spur.uzh.ch\/\" target=\"_blank\" rel=\"noopener\">Regional Linguistic Data Initiative ReLDI<\/a>. It includes the paraphrase corpus <a href=\"https:\/\/vukbatanovic.github.io\/paraphrase.sr\/\" target=\"_blank\" rel=\"noopener\">paraphrase.sr<\/a>, the movie review dataset <a href=\"https:\/\/vukbatanovic.github.io\/SerbMR\/\" target=\"_blank\" rel=\"noopener\">SerbMR<\/a>, constructed for the task of sentiment analysis, and the sentiment analysis dataset of comments <a href=\"https:\/\/vukbatanovic.github.io\/SentiComments.SR\/\" target=\"_blank\" rel=\"noopener\">SentiComments.SR<\/a>.<\/p>\n<hr class=\"shortcode hr blue\" style=\"width:100%;border-width:3px;\" \/>\n<h2 id=\"processing\">2. Tools to annotate Serbian texts<\/h2>\n<h4 id=\"q21\">Q2.1: How can I perform basic linguistic processing of my Serbian texts?<\/h4>\n<p>The state-of-the-art <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> provides processing of standard and non-standard (Internet) Serbian on the levels of tokenisation and sentence splitting, part-of-speech tagging, lemmatisation, dependency parsing, and named entity recognition. For Serbian, the CLASSLA-Stanza pipeline uses the rule-based <a href=\"https:\/\/github.com\/clarinsi\/reldi-tokeniser\" target=\"_blank\" rel=\"noopener\">reldi-tokeniser<\/a>. There are also available off-the-shelf models for lemmatisation of <a href=\"http:\/\/hdl.handle.net\/11356\/1830\" target=\"_blank\" rel=\"noopener\">standard<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1828\" target=\"_blank\" rel=\"noopener\">non-standard<\/a> Serbian, and part-of-speech tagging of <a href=\"http:\/\/hdl.handle.net\/11356\/1831\" target=\"_blank\" rel=\"noopener\">standard<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1825\" target=\"_blank\" rel=\"noopener\">non-standard<\/a> Serbian. You can try out the pipeline at the <a href=\"https:\/\/clarin.si\/oznacevalnik\/eng\" target=\"_blank\" rel=\"noopener\">CLASSLA Annotation tool<\/a> website.<\/p>\n<p>The documentation for the installation and use of the pipeline is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<p>In addition to this, tokenisation, part-of-speech tagging, and lemmatisation are provided by a CLARIN.SI service <a href=\"https:\/\/clarin.si\/services\/web\/\" target=\"_blank\" rel=\"noopener\">ReLDIanno<\/a> as well. This is a legacy system for linguistic annotation that we still keep available for backward compatibility, but we suggest new users to use the above-mentioned CLASSLA-Stanza pipeline.<\/p>\n<h4 id=\"q22\">Q2.2: How can I standardize my texts prior to further processing?<\/h4>\n<p>The <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a>, mentioned above, includes also models for processing of non-standard text, which allows non-standard texts to be annotated before previous standardization.<\/p>\n<p>Currently, the only on-line text normalisation tool available through the CLARIN.SI services (ReLDIanno) is the <a href=\"https:\/\/www.clarin.si\/services\/web\/login\" target=\"_blank\" rel=\"noopener\">REDI diacritic restorer<\/a>. Its usage is documented <a href=\"https:\/\/www.clarin.si\/info\/k-centre\/web-services-documentation\/\">here<\/a>. You can also <a href=\"https:\/\/github.com\/clarinsi\/redi\" target=\"_blank\" rel=\"noopener\">download<\/a> it, install it and use it locally.<\/p>\n<p>For word-level normalisation of user-generated Serbian texts, you can download and install the <a href=\"https:\/\/github.com\/clarinsi\/csmtiser\" target=\"_blank\" rel=\"noopener\">CSMTiser text normaliser<\/a>.<\/p>\n<h4 id=\"q23\">Q2.3: How can I annotate my texts for named entities?<\/h4>\n<p>Named entity recognition is provided by the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a>, which also offers off-the shelf models for <a href=\"http:\/\/hdl.handle.net\/11356\/1323\" target=\"_blank\" rel=\"noopener\">standard<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1341\" target=\"_blank\" rel=\"noopener\">non-standard<\/a> Serbian.<\/p>\n<p>In addition to this, on-line NER is available via the CLARIN.SI service <a href=\"https:\/\/www.clarin.si\/info\/k-centre\/web-services-documentation\/\">ReLDIanno<\/a>. You can also download the <a href=\"https:\/\/github.com\/clarinsi\/janes-ner\" target=\"_blank\" rel=\"noopener\">janes-ner<\/a> tool.<\/p>\n<h4 id=\"q24\">Q2.4: How can I syntactically parse my texts?<\/h4>\n<p>You can syntactically parse Serbian texts, following the <a href=\"https:\/\/universaldependencies.org\/u\/overview\/syntax.html\" target=\"_blank\" rel=\"noopener\">Universal Dependencies formalism<\/a>, in multiple ways:<\/p>\n<ul>\n<li>by using the state-of-the-art <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a>, which also offers an <a href=\"http:\/\/hdl.handle.net\/11356\/1835\" target=\"_blank\" rel=\"noopener\">off-the-shelf model<\/a><\/li>\n<li>by using the CLARIN.SI service <a href=\"https:\/\/www.clarin.si\/info\/k-centre\/web-services-documentation\/\" target=\"_blank\" rel=\"noopener\">ReLDIanno<\/a><\/li>\n<li>by using the <a href=\"https:\/\/ufal.mff.cuni.cz\/udpipe\" target=\"_blank\" rel=\"noopener\">UDPipe tool<\/a>, which has off-the-shelf models for many languages, Serbian included<\/li>\n<\/ul>\n<hr class=\"shortcode hr blue\" style=\"width:100%;border-width:3px;\" \/>\n<h2 id=\"training\">3. Datasets to train Serbian annotation tools<\/h2>\n<h4 id=\"q31\">Q3.1: Where can I get word embeddings or pre-trained language models for Serbian?<\/h4>\n<ul>\n<li>The embeddings trained on the srWaC and MaCoCu-sr web corpora is the <a href=\"http:\/\/hdl.handle.net\/11356\/1789\" target=\"_blank\" rel=\"noopener\">CLARIN.SI-embed.sr<\/a> embedding collection.<\/li>\n<li>There are also collections of trained embeddings for Serbian available from\u00a0<a href=\"https:\/\/fasttext.cc\/docs\/en\/crawl-vectors.html\" target=\"_blank\" rel=\"noopener\">fastText<\/a>\u00a0(Latin and Cyrillic script are intertwined and no transliteration was performed).<\/li>\n<li>If you want to train your own embeddings, the largest freely available collection of Serbian texts is the <a href=\"http:\/\/hdl.handle.net\/11356\/1426\" target=\"_blank\" rel=\"noopener\">BERTi\u0107-data<\/a> text collection.<\/li>\n<\/ul>\n<p>You can also use a transformer language model\u00a0<a href=\"https:\/\/huggingface.co\/classla\/bcms-bertic\" target=\"_blank\" rel=\"noopener\">BERTi\u0107<\/a>, a state-of-the-art model representing words\/tokens as contextually dependent word embeddings. It allows you to extract word embeddings for every word occurrence, which can then be used in training a model for an end task.<\/p>\n<h4 id=\"q32\">Q3.2: What data is available for training a text normaliser for Serbian?<\/h4>\n<p>For training text normalisers for Internet Serbian, the <a href=\"http:\/\/hdl.handle.net\/11356\/1794\" target=\"_blank\" rel=\"noopener\">ReLDI-NormTagNER-sr<\/a> dataset can be used,\u00a0a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Serbian.<\/p>\n<h4 id=\"q33\">Q3.3: What data is available for training a part-of-speech tagger for Serbian?<\/h4>\n<p>The reference dataset for training a standard tagger is <a href=\"http:\/\/hdl.handle.net\/11356\/1843\" target=\"_blank\" rel=\"noopener\">SETimes.SR<\/a>.\u00a0There is also the <a href=\"http:\/\/hdl.handle.net\/11356\/1794\" target=\"_blank\" rel=\"noopener\">ReLDI-NormTagNER-sr<\/a> training dataset of Internet Serbian.<\/p>\n<p>You can also use the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> in combination with the <a href=\"http:\/\/hdl.handle.net\/11356\/1789\" target=\"_blank\" rel=\"noopener\">CLARIN.SI embeddings<\/a> and the training dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1843\" target=\"_blank\" rel=\"noopener\">SETimes.SR<\/a> to train and evaluate your own part-of-speech tagger. The documentation is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md#part-of-speech-tagging-1\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h4 id=\"q34\">Q3.4: What data is available for training a lemmatiser for Serbian?<\/h4>\n<p>Lemmatisers can be trained either on the tagger training data (<a href=\"http:\/\/hdl.handle.net\/11356\/1843\" target=\"_blank\" rel=\"noopener\">SETimes.SR<\/a>, <a href=\"http:\/\/hdl.handle.net\/11356\/1794\" target=\"_blank\" rel=\"noopener\">ReLDI-NormTagNER-sr<\/a>, see the section on PoS tagger training for details) and\/or on the inflectional lexicon <a href=\"http:\/\/hdl.handle.net\/11356\/1233\" target=\"_blank\" rel=\"noopener\">srLex<\/a>.<\/p>\n<p>For training your own lemmatiser for standard and non-standard Serbian, you can use the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a>, which uses the external lexicon for lemmatisation (<a href=\"http:\/\/hdl.handle.net\/11356\/1233\" target=\"_blank\" rel=\"noopener\">srLex<\/a>). The documentation is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md#lemmatization\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h4 id=\"q35\">Q3.5: What data is available for training a named entity recogniser for Serbian?<\/h4>\n<p>For training a named entity recogniser of standard language, <a href=\"http:\/\/hdl.handle.net\/11356\/1843\" target=\"_blank\" rel=\"noopener\">SETimes.SR<\/a> is the best resource. For training NER systems for online, non-standard texts, <a href=\"http:\/\/hdl.handle.net\/11356\/1794\" target=\"_blank\" rel=\"noopener\">ReLDI-NormTagNER-sr<\/a> can be used.<\/p>\n<p>The <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> allows you to train your own named entity recogniser as well. The documentation is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md#ner-1\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h4 id=\"q36\">Q3.6: What data is available for training a syntactic parser for Serbian?<\/h4>\n<p>If you want to follow the <a href=\"https:\/\/universaldependencies.org\/u\/overview\/syntax.html\" target=\"_blank\" rel=\"noopener\">Universal Dependencies formalism<\/a> for dependency parsing, the best location for obtaining training data is the <a href=\"https:\/\/github.com\/UniversalDependencies\/UD_Serbian-SET\" target=\"_blank\" rel=\"noopener\">Universal Dependencies repository<\/a>.<\/p>\n<p>If you require additional annotation layers, e.g., for multi-task learning, the <a href=\"http:\/\/hdl.handle.net\/11356\/1843\" target=\"_blank\" rel=\"noopener\">SETimes.SR<\/a> dataset should be used.<\/p>\n<p>You can also use the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> to train your own parser. The documentation is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md#parsing-1\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n\n<p>&nbsp;<\/p>\n<div id=\"themify_builder_content-3568\" data-postid=\"3568\" class=\"themify_builder_content themify_builder_content-3568 themify_builder\">\n    <\/div>\n<!-- \/themify_builder_content -->\n","protected":false},"excerpt":{"rendered":"<p>This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on helpdesk.classla@clarin.si, Subject &#8220;FAQ_Serbian&#8221;. The questions in this FAQ are organised into three main sections: 1. Online Serbian language resources Q1.1: Where can I find [&hellip;]<\/p>\n","protected":false},"author":9,"featured_media":0,"parent":3558,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-3568","page","type-page","status-publish","hentry","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3568","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/9"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=3568"}],"version-history":[{"count":71,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3568\/revisions"}],"predecessor-version":[{"id":7896,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3568\/revisions\/7896"}],"up":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3558"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=3568"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}