{"id":3564,"date":"2019-03-20T13:35:29","date_gmt":"2019-03-20T13:35:29","guid":{"rendered":"http:\/\/www.clarin.si\/info\/?page_id=3564"},"modified":"2024-12-19T13:59:25","modified_gmt":"2024-12-19T13:59:25","slug":"faq4slovene","status":"publish","type":"page","link":"https:\/\/www.clarin.si\/info\/k-centre\/faq4slovene\/","title":{"rendered":"FAQ for Slovene language resources and technologies"},"content":{"rendered":"<p>This FAQ is part of the documentation of the <a href=\"..\/\">CLASSLA<\/a> CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on <a href=\"mailto:helpdesk.classla@clarin.si?subject=FAQ_Slovene\">helpdesk.classla@clarin.si<\/a>, Subject &#8220;FAQ_Slovene&#8221;.<\/p>\n<p>The questions in this FAQ are organised into three main sections:<\/p>\n\n<h2 id=\"existing\">1. Online Slovene language resources<\/h2>\n<h4 id=\"q11\">Q1.1: Where can I find Slovene dictionaries?<\/h4>\n<p>Below we list the main dictionary portals offered by CLARIN.SI partners or supported by CLARIN.SI:<\/p>\n<div id=\"tab-tab-2860-0-0-6-2860-0\" class=\"tab-content\">\n<ul>\n<li><a href=\"https:\/\/fran.si\/\" target=\"_blank\" rel=\"noopener\">FRAN<\/a> offers aggregate search over all Slovene dictionaries (general, etymological, historical, terminological, and dialectal) of the <a href=\"http:\/\/www.clarin.si\/info\/partners\/#zrcsazu\">Fran Ramov\u0161 Institute of the Slovenian Language at ZRC<\/a>. The Institute also offers a School Dictionary of Slovenian Language on the <a href=\"https:\/\/www.xn--franek-l2a.si\/\" target=\"_blank\" rel=\"noopener\">Fran\u010dek<\/a> portal.<\/li>\n<li><a href=\"https:\/\/viri.cjvt.si\/sopomenke\/eng\/\" target=\"_blank\" rel=\"noopener\">Thesaurus<\/a>, <a href=\"https:\/\/viri.cjvt.si\/kolokacije\/eng\/\" target=\"_blank\" rel=\"noopener\">Collocation dictionary<\/a>, a small glossary of <a href=\"http:\/\/lexonomy.cjvt.si\/slovar-tviterscine\/\" target=\"_blank\" rel=\"noopener\">Twitterese<\/a>, and a <a href=\"https:\/\/viri.cjvt.si\/slovensko-madzarski\/eng\/\" target=\"_blank\" rel=\"noopener\">Slovene-Hungarian dictionary<\/a> are available at the <a href=\"http:\/\/www.clarin.si\/info\/partners\/#ul\">Ljubljana University CJVT infrastructure centre<\/a>.<\/li>\n<li><a href=\"https:\/\/www.kontekst.io\/\" target=\"_blank\" rel=\"noopener\">Kontekst.io<\/a> is a lexicon of semantically related words for Slovene, Croatian and Serbian, automatically produced on the basis of word-embeddings from large corpora.<\/li>\n<li><a title=\"http:\/\/www.termania.net\" href=\"https:\/\/www.termania.net\/\" target=\"_blank\" rel=\"noopener\">Termania<\/a> is a portal of free online dictionaries of various languages and fields, offered by the <a href=\"http:\/\/www.clarin.si\/info\/partners\/#amebis\">Amebis company<\/a>.<\/li>\n<li><a title=\"http:\/\/www.slovenscina.eu\/sloleks\" href=\"http:\/\/eng.slovenscina.eu\/sloleks\" target=\"_blank\" rel=\"noopener\">Sloleks<\/a>,<span class=\"style\"> a Slovene morphological lexicon, <\/span><span class=\"style\">a test Slovene on-line dictionary <\/span><a title=\"http:\/\/www.slovenscina.eu\/spletni-slovar\" href=\"http:\/\/eng.slovenscina.eu\/spletni-slovar\" target=\"_blank\" rel=\"noopener\">SSSJ<\/a><span class=\"style\">, and <\/span><span class=\"style\">a prototype Slovene lexical database <\/span><a title=\"http:\/\/www.slovenscina.eu\/spletni-slovar\/leksikalna-baza\" href=\"http:\/\/eng.slovenscina.eu\/spletni-slovar\/leksikalna-baza\" target=\"_blank\" rel=\"noopener\">LBS<\/a><span class=\"style\">\u00a0are among the results of the &#8220;<a href=\"http:\/\/projekt.slovenscina.eu\/\" target=\"_blank\" rel=\"noopener\">Communication in Slovene<\/a>&#8221; project, with the portal hosted at CLARIN.SI.<\/span><\/li>\n<li><a class=\"style_3\" title=\"http:\/\/nl.ijs.si\/slownet\" href=\"http:\/\/nl.ijs.si\/slownet\" target=\"_blank\" rel=\"noopener\">sloWNet<\/a>, the WordNet-based Slovenian semantic lexicon, <a title=\"http:\/\/nl.ijs.si\/imp\/#lexicon\" href=\"http:\/\/nl.ijs.si\/imp\/#lexicon\" target=\"_blank\" rel=\"noopener\">IMP<\/a>,<span class=\"style\"> a glossary of archaic Slovene, and <a href=\"http:\/\/nl.ijs.si\/jaslo\/index-en.html\" target=\"_blank\" rel=\"noopener\">jaSlo<\/a>, a Japanese-Slovene learners&#8217; dictionary, are offered by the <a href=\"http:\/\/www.clarin.si\/info\/partners\/#ijs\">Jo\u017eef Stefan Institute<\/a>.<br \/>\n<\/span><\/li>\n<li><a href=\"http:\/\/hdl.handle.net\/11356\/1888\" target=\"_blank\" rel=\"noopener\">WordNet OSWN<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1925\" target=\"_blank\" rel=\"noopener\">sloWNet-USAS<\/a>, the extended Slovenian semantic lexicons, which include the sloWNet lexicon, are available on <a href=\"https:\/\/www.clarin.si\/repository\/xmlui\/?locale-attribute=en\" target=\"_blank\" rel=\"noopener\">the CLARIN.SI repository<\/a>.<\/li>\n<li><a class=\"style_3\" title=\"http:\/\/www.razvezanijezik.org\/\" href=\"http:\/\/www.razvezanijezik.org\/\" target=\"_blank\" rel=\"noopener\">Razvezani jezik<\/a>, a user-generated dictionary of spoken Slovene, is offered by the Domestic Research Society.<\/li>\n<li>Numerous terminological dictionaries, such as the <a href=\"http:\/\/hdl.handle.net\/11356\/1731\" target=\"_blank\" rel=\"noopener\">Slovenian-English glossary of education<\/a>, the <a href=\"http:\/\/hdl.handle.net\/11356\/1727\" target=\"_blank\" rel=\"noopener\">Terminological dictionary of artificial intelligence<\/a>, the <a href=\"http:\/\/hdl.handle.net\/11356\/1721\" target=\"_blank\" rel=\"noopener\">Terminological dictionary of tax terminology<\/a> and others, can be downloaded from <a href=\"https:\/\/www.clarin.si\/repository\/xmlui\/?locale-attribute=en\">the repository of CLARIN.SI<\/a>.<\/li>\n<\/ul>\n<p>Dictionaries by other providers:<\/p>\n<ul>\n<li><a class=\"style_3\" title=\"http:\/\/www.evroterm.gov.si\/\" href=\"http:\/\/www.evroterm.gov.si\/\" target=\"_blank\" rel=\"noopener\">Evroterm<\/a>, multilingual terminology database, and a <a title=\"http:\/\/evroterm.gov.si\/slovar\/\" href=\"https:\/\/evroterm.vlada.si\/slovarji\" target=\"_blank\" rel=\"noopener\">list of on-line dictionaries<\/a>, by the Government of the Republic of Slovenia;<\/li>\n<li><a class=\"style_3\" title=\"http:\/\/www.islovar.org\/\" href=\"http:\/\/www.islovar.org\/\" target=\"_blank\" rel=\"noopener\">Islovar<\/a>, a terminological dictionary for the field of Informatics by the Slovene Society for Informatics;<\/li>\n<li><a class=\"style_3\" title=\"http:\/\/sl.wiktionary.org\/\" href=\"http:\/\/sl.wiktionary.org\/\" target=\"_blank\" rel=\"noopener\">Wikislovar<\/a>, the Slovene Wiktionary, i.e., the multilingual, openly accessible and openly editable dictionary.<\/li>\n<\/ul>\n<\/div>\n<h4 id=\"q12\">Q1.2: How can I\u00a0analyse Slovene corpora online?<\/h4>\n<p>CLARIN.SI offers access to four concordancers, which share the same set of corpora and back-end, but have different front-ends:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.clarin.si\/ske\/\" target=\"_blank\" rel=\"noopener noreferrer\">CLARIN.SI Crystal noSketch Engine<\/a>, an open-source variant of the well-known <a href=\"https:\/\/www.sketchengine.eu\" target=\"_blank\" rel=\"noopener\">Sketch Engine<\/a>.\u00a0 Instructions for its use are available\u00a0<a href=\"https:\/\/www.sketchengine.co.uk\/user-guide\/\">here<\/a>. CLARIN.SI offers two installations of Crystal noSketch Engine: an <a href=\"https:\/\/www.clarin.si\/ske\/\" target=\"_blank\" rel=\"noopener\">open installation<\/a> (no log-in, which simplifies use for less advanced users) and a <a href=\"https:\/\/www.clarin.si\/skelog\" target=\"_blank\" rel=\"noopener\">version with log-in<\/a> which allows subcorpus creation and personalised display of e.g. corpus attributes.<\/li>\n<li><a href=\"https:\/\/www.clarin.si\/kontext\/\" target=\"_blank\" rel=\"noopener\">KonText<\/a>, with a somewhat different user interface. Basic functionality is provided without logging in, but to use more advanced functionalities, it is necessary to log in via your home institution.<\/li>\n<li><a href=\"https:\/\/www.clarin.si\/noske\/\" target=\"_blank\" rel=\"noopener noreferrer\">CLARIN.SI Bonito noSketch Engine<\/a> is the old version of noSketch Engine with a radically different user interface from Crystal. This version offers some functions that the new noSketch Engine does not, in particular, accessing the results of queries in XML, where it is enough to add the parameter \u201cformat=XML\u201d to the end of the query URL.<\/li>\n<\/ul>\n<p>Documentation on how to query corpora via the SketchEngine-like interfaces is available <a href=\"https:\/\/www.sketchengine.eu\/documentation\/corpus-querying\/\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<p>Note that the commercial <a href=\"https:\/\/www.sketchengine.eu\" target=\"_blank\" rel=\"noopener\">Sketch Engine<\/a> also offers access to several Slovene language corpora, as well as\u00a0some additional tools that are not accessible on the free NoSketch Engine, including the tools to analyse collocations (<a href=\"https:\/\/www.sketchengine.eu\/guide\/word-sketch-collocations-and-word-combinations\/\" target=\"_blank\" rel=\"noopener\">Word sketches<\/a>), synonyms and antonyms (<a href=\"https:\/\/www.sketchengine.eu\/guide\/thesaurus-synonyms-antonyms-similar-words\/\" target=\"_blank\" rel=\"noopener\">Thesaurus<\/a>), the tools to compute frequency lists of multiword expressions (<a href=\"https:\/\/www.sketchengine.eu\/guide\/n-grams-multiword-expressions\/\" target=\"_blank\" rel=\"noopener\">N-grams<\/a>) and to extract <a href=\"https:\/\/www.sketchengine.eu\/guide\/keywords-and-term-extraction\/\" target=\"_blank\" rel=\"noopener\">keywords and terms<\/a>. It also allows users to create their own corpora.<\/p>\n<p>Some Slovene corpora, esp. those produced in the scope of the &#8220;<a href=\"http:\/\/projekt.slovenscina.eu\/\" target=\"_blank\" rel=\"noopener\">Communication in Slovene<\/a>&#8221; project have their specialised web concordancers, cf. the corpora listed in <a href=\"#q13\">Q1.3<\/a>.<\/p>\n<h4>Q1.3: Which Slovene corpora can I analyse\u00a0online?<\/h4>\n<p>The main reference corpus for Slovene is <a href=\"https:\/\/viri.cjvt.si\/gigafida\/System\/About\" target=\"_blank\" rel=\"noopener\">Gigafida<\/a> (1 billion words), which you can query <a href=\"https:\/\/viri.cjvt.si\/gigafida\/\" target=\"_blank\" rel=\"noopener\">via its specialized interface<\/a>, via <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=gfida20\" target=\"_blank\" rel=\"noopener\">Crystal noSkE<\/a>, <a href=\"https:\/\/www.clarin.si\/noske\/run.cgi\/corp_info?corpname=gfida20&amp;struct_attr_stats=1\" target=\"_blank\" rel=\"noopener\">Bonito noSkE<\/a> or <a href=\"https:\/\/www.clarin.si\/kontext\/first_form?corpname=gfida20\" target=\"_blank\" rel=\"noopener\">KonText<\/a>. Note that the corpus is also available in a version which has (near) duplicate paragraphs removed, cf. <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=gfida20_dedup\" target=\"_blank\" rel=\"noopener\">noSkE<\/a> or <a href=\"https:\/\/www.clarin.si\/kontext\/first_form?corpname=gfida20_dedup\" target=\"_blank\" rel=\"noopener\">KonText<\/a>. A balanced subset of Gigafida is <a href=\"http:\/\/eng.slovenscina.eu\/korpusi\/kres\" target=\"_blank\" rel=\"noopener\">KRES<\/a> (100 million tokens), which you can query <a href=\"http:\/\/www.korpus-kres.net\/\" target=\"_blank\" rel=\"noopener\">via its specialized interface.<\/a><\/p>\n<p>For a complete list of corpora available under CLARIN.SI concordancers, see the index for <a href=\"https:\/\/www.clarin.si\/ske\/#open\" target=\"_blank\" rel=\"noopener\">Crystal noSkE<\/a>, <a href=\"https:\/\/www.clarin.si\/noske\/index.html\" target=\"_blank\" rel=\"noopener\">Bonito noSkE<\/a> or <a href=\"https:\/\/www.clarin.si\/kontext\/\" target=\"_blank\" rel=\"noopener\">KonText<\/a>. Below we list some of the important ones, with links to the Crystal noSketch Engine concordancer:<\/p>\n<ul>\n<li><em>general language corpora<\/em>\u00a0(apart from <a href=\"https:\/\/viri.cjvt.si\/gigafida\/\" target=\"_blank\" rel=\"noopener\">Gigafida<\/a>) are <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlaweb_sl\" target=\"_blank\" rel=\"noopener\">CLASSLA-web.sl<\/a> and <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=slwac\" target=\"_blank\" rel=\"noopener\">slWaC<\/a>, large corpora (2 billion and 900 million tokens respectively) of Slovene texts from the Web<\/li>\n<li><em>specialized corpora<\/em> include the corpus of academic writing <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=kas\" target=\"_blank\" rel=\"noopener\">KAS<\/a>, the corpus of scientific publications from the Open Science portal <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=oss10\" target=\"_blank\" rel=\"noopener\">OSS<\/a>, the corpus of scientific texts of contemporary Slovenian <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=kzb10\" target=\"_blank\" rel=\"noopener\">KZB<\/a>, the corpus of user-generated content (blogs, forums, comments, and tweets) <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=janes\" target=\"_blank\" rel=\"noopener\">Janes<\/a>, the monitor news corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=trendi\" target=\"_blank\" rel=\"noopener\">Trendi<\/a>, the spoken corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=gos21\" target=\"_blank\" rel=\"noopener\">GOS<\/a>, the parliamentary corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=siparl40\" target=\"_blank\" rel=\"noopener\">siParl<\/a>, <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlamint41_si\" target=\"_blank\" rel=\"noopener\">ParlaMint-SI<\/a>, <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=yu1parl\" target=\"_blank\" rel=\"noopener\">yu1Parl<\/a> and the Carniolan Provincial Assembly corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=kranjska\" target=\"_blank\" rel=\"noopener\">Kranjska<\/a>, the Wikipedia corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=classlawiki_sl\" target=\"_blank\" rel=\"noopener\">CLASSLAWiki-sl<\/a>, the corpus of historical Slovene <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=imp\" target=\"_blank\" rel=\"noopener\">IMP<\/a>, the corpus of Slovenian periodicals (1771-1914) <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=speriodika\" target=\"_blank\" rel=\"noopener\">sPeriodika<\/a>, the <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=pregovori\" target=\"_blank\" rel=\"noopener\">Proverbs<\/a> corpus, the corpus of longer narrative prose <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=kdsp\" target=\"_blank\" rel=\"noopener\">KDSP<\/a>, the corpus of youth literature <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=maks\" target=\"_blank\" rel=\"noopener\">MAKS<\/a>, the developmental corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=solar30_orig\" target=\"_blank\" rel=\"noopener\">\u0160OLAR<\/a>, and the corpus of Slovene as a foreign language <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=kost20_orig\" target=\"_blank\" rel=\"noopener\">KOST<\/a><\/li>\n<li><em>manually annotated corpora<\/em> include the reference training corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=suk11\" target=\"_blank\" rel=\"noopener\">SUK<\/a>, the corpus of historical Slovene <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=goo300k\" target=\"_blank\" rel=\"noopener\">goo300k<\/a> (sampled from the IMP corpus), the corpus of term-annotated texts <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=rsdo5\" target=\"_blank\" rel=\"noopener\">RSDO5<\/a>, and the corpora of user-generated content <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=janes_norm30\" target=\"_blank\" rel=\"noopener\">Janes Norm<\/a> (sampled from the Janes corpus), which is manually annotated with normalised word-forms, and <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=janes_tag\" target=\"_blank\" rel=\"noopener\">Janes Tag<\/a>\u00a0(sampled from Janes-norm), also manually annotated with morphosyntactic descriptions, lemmas, and named entities<\/li>\n<li><em>a meta corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=mfida10\" target=\"_blank\" rel=\"noopener\">metaFida<\/a><\/em>, which contains 6 billion tokens, unites the most important publicly accessible Slovene corpora and enables a uniform search through them<\/li>\n<li><em>parallel corpora<\/em> include the multilingual European parliamentary corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlamint41_xx\" target=\"_blank\" rel=\"noopener\">ParlaMint-XX<\/a>, paired with the machine-translated English corpora <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=parlamint41_xx_en\" target=\"_blank\" rel=\"noopener\">ParlaMint-XX-en<\/a>, multilingual DGT translation memory corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=dgtud_sl\" target=\"_blank\" rel=\"noopener\">EU DGT-UD: Slovenian<\/a>, the Slovene-English corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=trans5_sl\" target=\"_blank\" rel=\"noopener\">TRANS5<\/a>, the Italian-Slovene corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=ispac_sl\" target=\"_blank\" rel=\"noopener\">ISPAC<\/a>, the French-Slovene corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=lemonde_sl\" target=\"_blank\" rel=\"noopener\">LeMonde<\/a>, and the Japanese-Slovene corpus <a href=\"https:\/\/www.clarin.si\/ske\/#dashboard?corpname=jaslo_sl\" target=\"_blank\" rel=\"noopener\">jaSlo<\/a><\/li>\n<\/ul>\n<p>Furthermore, the commercial <a href=\"https:\/\/www.sketchengine.eu\" target=\"_blank\" rel=\"noopener\">Sketch Engine<\/a> includes the <a href=\"https:\/\/www.sketchengine.eu\/corpora-and-languages\/slovenian-text-corpora\/\" target=\"_blank\" rel=\"noopener\">following Slovene corpora<\/a>: learner corpus of proofread and translations Lektor, which you can also query <a href=\"http:\/\/lektor.sketchengine.co.uk\/run.cgi\/first_form?corpname=fidaplus_lektor\" target=\"_blank\" rel=\"noopener\">via its specialized interface,<\/a>\u00a0<a href=\"https:\/\/www.sketchengine.eu\/eurlex-corpus\/\" target=\"_blank\" rel=\"noopener\">EUR-Lex Slovenian<\/a> 2\/2016, parallel corpus <a href=\"https:\/\/www.sketchengine.eu\/europarl-parallel-corpus\/\" target=\"_blank\" rel=\"noopener\">EUROPARL7<\/a>, created from the European Parliament Proceedings, and <a href=\"https:\/\/www.sketchengine.eu\/opus-parallel-corpora\/\" target=\"_blank\" rel=\"noopener\">OPUS2<\/a>, a parallel corpus of 40 languages.<\/p>\n<h4>Q1.4: What linguistic annotation schemas are used in Slovene corpora?<\/h4>\n<p>For a detailed overview of annotation schemas, see the\u00a0<a href=\"https:\/\/wiki.cjvt.si\/shelves\/linguistic-annotation-of-slovene-corpora\" target=\"_blank\" rel=\"noopener\">information on linguistic annotation of Slovene corpora<\/a> on the\u00a0<a href=\"https:\/\/wiki.cjvt.si\/\" target=\"_blank\" rel=\"noopener\">CJVT Wiki<\/a>. The overview covers the following corpus annotation levels: <a href=\"https:\/\/wiki.cjvt.si\/books\/01-tokenization\" target=\"_blank\" rel=\"noopener\">tokenisation<\/a>, <a href=\"https:\/\/wiki.cjvt.si\/books\/02-segmentation\" target=\"_blank\" rel=\"noopener\">sentence segmentation<\/a>, <a href=\"https:\/\/wiki.cjvt.si\/books\/05-lemmatization\" target=\"_blank\" rel=\"noopener\">lemmatisation<\/a>, and <a href=\"https:\/\/wiki.cjvt.si\/books\/06-jos-syn-syntax\" target=\"_blank\" rel=\"noopener\">JOS<\/a>\/<a href=\"https:\/\/wiki.cjvt.si\/books\/04-multext-east-morphosyntax\" target=\"_blank\" rel=\"noopener\">MULTEXT-East<\/a> morphosyntactic descriptions, <a href=\"https:\/\/wiki.cjvt.si\/books\/06-jos-syn-syntax\" target=\"_blank\" rel=\"noopener\">JOS syntax<\/a>, <a href=\"https:\/\/wiki.cjvt.si\/books\/07-universal-dependencies-FPQ\" target=\"_blank\" rel=\"noopener\">Universal Dependencies (UD) syntax<\/a>, <a href=\"https:\/\/wiki.cjvt.si\/books\/10-semantic-role-labeling\" target=\"_blank\" rel=\"noopener\">semantic role labelling<\/a> (SRL), <a href=\"https:\/\/wiki.cjvt.si\/books\/08-named-entities\" target=\"_blank\" rel=\"noopener\">named entities<\/a> (NER), <a href=\"https:\/\/wiki.cjvt.si\/books\/03-normalization\" target=\"_blank\" rel=\"noopener\">normalization<\/a>,\u00a0 <a href=\"https:\/\/wiki.cjvt.si\/books\/09-coreferences\" target=\"_blank\" rel=\"noopener\">coreferences<\/a>, and <a href=\"https:\/\/wiki.cjvt.si\/books\/13-relations\" target=\"_blank\" rel=\"noopener\">relations<\/a>. It also includes specialised systems for annotating language corrections in the <a href=\"https:\/\/wiki.cjvt.si\/books\/11-developmental-corpus-solar\" target=\"_blank\" rel=\"noopener\">\u0160olar (Slovene student texts)<\/a> and <a href=\"https:\/\/wiki.cjvt.si\/books\/12-slovene-learner-corpus-kost\" target=\"_blank\" rel=\"noopener\">KOST (texts by speakers of Slovene as a foreign language)<\/a> corpora. The section on each annotation level contains an introduction, explanation of tags or processes, annotation guidelines, and relevant references and links.<\/p>\n<p>Most of Slovene corpora are annotated according to the <a href=\"https:\/\/wiki.cjvt.si\/books\/04-multext-east-morphosyntax\" target=\"_blank\" rel=\"noopener\">MULTEXT-East morphosyntactic specifications for Slovene<\/a>. On the level of syntax, and esp. for older corpora, the <a href=\"https:\/\/wiki.cjvt.si\/books\/06-jos-syn-syntax\" target=\"_blank\" rel=\"noopener\">Slovene-specific SSJ annotation scheme<\/a> is used. Corpora are also annotated according to the <a href=\"https:\/\/universaldependencies.org\/u\/overview\/syntax.html\" target=\"_blank\" rel=\"noopener\">Universal Dependencies guidelines<\/a>. Named entities are often annotated following the <a href=\"https:\/\/nl.ijs.si\/janes\/wp-content\/uploads\/2017\/09\/SlovenianNER-eng-v1.1.pdf\" target=\"_blank\" rel=\"noopener\">Janes NE guidelines for Slovene<\/a>.<\/p>\n<h4 id=\"q15\">Q1.5: Where can I download Slovene resources?<\/h4>\n<p>The main point for archiving and downloading Slovene language resources is <a href=\"https:\/\/www.clarin.si\/repository\/xmlui\/?locale-attribute=en\">the repository of CLARIN.SI<\/a>.<\/p>\n<p>In addition to the resources mentioned above and below, the repository offers:<\/p>\n<ul>\n<li><em>manually annotated corpora and datasets<\/em>, including the\u00a0corpus of comma placement <a href=\"http:\/\/hdl.handle.net\/11356\/1185\" target=\"_blank\" rel=\"noopener\">Vejica 1.3<\/a>, the dataset of idiomatic expressions <a href=\"http:\/\/hdl.handle.net\/11356\/1335\" target=\"_blank\" rel=\"noopener\">SloIE<\/a>, the corpora of metaphorical expressions <a href=\"http:\/\/hdl.handle.net\/11356\/1293\" target=\"_blank\" rel=\"noopener\">KOMET<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1490\" target=\"_blank\" rel=\"noopener\">G-KOMET<\/a>, the bilingual terminology extraction dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1199\" target=\"_blank\" rel=\"noopener\">KAS-biterm<\/a>, the sentiment annotated news corpus <a href=\"http:\/\/hdl.handle.net\/11356\/1110\" target=\"_blank\" rel=\"noopener\">SentiNews<\/a>, the multilingual sentiment dataset of parliamentary debates <a href=\"http:\/\/hdl.handle.net\/11356\/1868\" target=\"_blank\" rel=\"noopener\">ParlaSent<\/a>, the English-Slovene text genre dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1960\" target=\"_blank\" rel=\"noopener\">X-GENRE<\/a>, the <a href=\"http:\/\/hdl.handle.net\/11356\/1651\" target=\"_blank\" rel=\"noopener\">Semantic change detection datasets for Slovenian<\/a>, the Tweet code-switching corpus <a href=\"http:\/\/hdl.handle.net\/11356\/1154\" target=\"_blank\" rel=\"noopener\">Janes-Preklop<\/a>, the offensive language dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1462\" target=\"_blank\" rel=\"noopener\">FRENK<\/a>, annotated for different types of socially unacceptable discourse, the post-edited and error annotated machine translation corpus <a href=\"http:\/\/hdl.handle.net\/11356\/1065\" target=\"_blank\" rel=\"noopener\">PErr<\/a>, the commonsense reasoning dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1766\" target=\"_blank\" rel=\"noopener\">DIALECT-COPA<\/a> in the Cerkno dialect, and the dataset for evaluation of Slovene spell- and grammar-checking tools <a href=\"http:\/\/hdl.handle.net\/11356\/1902\" target=\"_blank\" rel=\"noopener\">\u0160olar-Eval<\/a>;<\/li>\n<li><em>other parallel corpora, <\/em>including the Slovene-English parallel corpora <a href=\"http:\/\/hdl.handle.net\/11356\/1813\" target=\"_blank\" rel=\"noopener\">MaCoCu-sl-en<\/a>, <a href=\"http:\/\/hdl.handle.net\/11356\/1061\" target=\"_blank\" rel=\"noopener\">slenWaC<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1457\" target=\"_blank\" rel=\"noopener\">RSDO4 1.0<\/a>, and the Slovene-English parallel corpus of idiomatic text <a href=\"http:\/\/hdl.handle.net\/11356\/1714\" target=\"_blank\" rel=\"noopener\">ParaDiom<\/a>;<\/li>\n<li><em>other corpora and datasets,<\/em> including a large web corpus <a href=\"http:\/\/hdl.handle.net\/11356\/1795\" target=\"_blank\" rel=\"noopener\">MaCoCu-sl<\/a> with 1.9 billion words, also available as a genre-enriched version inside the <a href=\"http:\/\/hdl.handle.net\/11356\/1969\" target=\"_blank\" rel=\"noopener\">MaCoCu-Genre<\/a> corpus collection, the linguistically annotated corpus of parliamentary debates <a href=\"http:\/\/hdl.handle.net\/11356\/1911\" target=\"_blank\" rel=\"noopener\">ParlaMint.ana<\/a>, the sentiment annotated <a href=\"http:\/\/hdl.handle.net\/11356\/1054\" target=\"_blank\" rel=\"noopener\">Twitter corpora<\/a>, the <a href=\"http:\/\/hdl.handle.net\/11356\/1423\" target=\"_blank\" rel=\"noopener\">Twitter dataset<\/a> with automatically assigned hate speech labels, the corpus of jokes <a href=\"http:\/\/hdl.handle.net\/11356\/1945\" target=\"_blank\" rel=\"noopener\">\u0160ale24<\/a>, the IPTC news media topic dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1991\" target=\"_blank\" rel=\"noopener\">EMMediaTopic<\/a>,\u00a0 the corpus of textbooks <a href=\"http:\/\/hdl.handle.net\/11356\/1693\" target=\"_blank\" rel=\"noopener\">ccU\u010dbeniki<\/a>,\u00a0the corpus of 1968 literature <a href=\"http:\/\/hdl.handle.net\/11356\/1491\" target=\"_blank\" rel=\"noopener\">Maj68<\/a>, the dataset of medical texts <a href=\"http:\/\/hdl.handle.net\/11356\/1983\" target=\"_blank\" rel=\"noopener\">PoVeJMo-VeMo-Med<\/a>, the <a href=\"http:\/\/hdl.handle.net\/11356\/1694\" target=\"_blank\" rel=\"noopener\">Slovenian datasets for contextual synonym and antonym detection<\/a>, the parallel sense-annotated corpus <a href=\"http:\/\/hdl.handle.net\/11356\/1842\" target=\"_blank\" rel=\"noopener\">ELEXIS-WSD<\/a>, the <a href=\"http:\/\/hdl.handle.net\/11356\/1988\" target=\"_blank\" rel=\"noopener\">KE-WSC Winograd Schema Challenge<\/a> dataset for knowledge-enhanced machine learning and model evaluation, the text simplification dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1682\" target=\"_blank\" rel=\"noopener\">SloTS<\/a>, the natural language inference dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1707\" target=\"_blank\" rel=\"noopener\">SI-NLI<\/a>, the corpus for general relation extraction <a href=\"http:\/\/hdl.handle.net\/11356\/1730\" target=\"_blank\" rel=\"noopener\">SloREL<\/a>, the instruction-following datasets for large language models <a href=\"http:\/\/hdl.handle.net\/11356\/1971\" target=\"_blank\" rel=\"noopener\">GaMS-Instruct-GEN<\/a>, <a href=\"http:\/\/hdl.handle.net\/11356\/1975\" target=\"_blank\" rel=\"noopener\">GaMS-Instruct-DH<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1982\" target=\"_blank\" rel=\"noopener\">GaMS-Instruct-MED<\/a>, the automatic speech recognition database <a href=\"http:\/\/hdl.handle.net\/11356\/1772\" target=\"_blank\" rel=\"noopener\">ARTUR<\/a>, and the <a href=\"http:\/\/hdl.handle.net\/11356\/1977\" target=\"_blank\" rel=\"noopener\">spoken corpus Berta<\/a>;<\/li>\n<li><em>wordlists and other lexical resources,\u00a0<\/em>including the <a href=\"http:\/\/hdl.handle.net\/11356\/1780\" target=\"_blank\" rel=\"noopener\">terminological multiword expressions lexicon<\/a>, the <a href=\"http:\/\/hdl.handle.net\/11356\/1846\" target=\"_blank\" rel=\"noopener\">WeSoSlaV Western South Slavic verbal database<\/a>, the <a href=\"http:\/\/hdl.handle.net\/11356\/1697\" target=\"_blank\" rel=\"noopener\">core vocabulary for Slovenian as L2<\/a>, the <a href=\"http:\/\/hdl.handle.net\/11356\/1980\" target=\"_blank\" rel=\"noopener\">SWOW-SL word association dataset<\/a>, a <a href=\"http:\/\/hdl.handle.net\/11356\/1048\" target=\"_blank\" rel=\"noopener\">lexicon of emoji characters<\/a> with automatically assigned sentiment, and the lexicon of emotion, valence, arousal and dominance <a href=\"http:\/\/hdl.handle.net\/11356\/1875\" target=\"_blank\" rel=\"noopener\">SloEmoLex<\/a>.<\/li>\n<\/ul>\n<hr class=\"shortcode hr blue\" style=\"width:100%;border-width:3px;\" \/>\n<h2 id=\"processing\">2. Tools to annotate Slovene texts<\/h2>\n<h4 id=\"q21\">Q2.1: How can I perform basic linguistic processing of my Slovene texts?<\/h4>\n<p>The state-of-the-art <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> provides processing of standard and non-standard (Internet) Slovene on the levels of tokenization and sentence splitting, part-of-speech tagging, lemmatisation, dependency parsing and named entity recognition. The CLASSLA-Stanza pipeline uses two tokenizers: rule-based tokenizer <a href=\"https:\/\/github.com\/clarinsi\/Obeliks4J\" target=\"_blank\" rel=\"noopener\">Obeliks4J<\/a> for Slovene standard language pipeline and <a href=\"https:\/\/github.com\/clarinsi\/reldi-tokeniser\" target=\"_blank\" rel=\"noopener\">reldi-tokeniser<\/a> for other cases. There are also available off-the-shelf models for lemmatisation of <a href=\"http:\/\/hdl.handle.net\/11356\/1768\" target=\"_blank\" rel=\"noopener\">standard<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1784\" target=\"_blank\" rel=\"noopener\">non-standard<\/a> Slovene, part-of-speech tagging of <a href=\"http:\/\/hdl.handle.net\/11356\/1767\" target=\"_blank\" rel=\"noopener\">standard<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1786\" target=\"_blank\" rel=\"noopener\">non-standard<\/a> Slovene, and semantic role labeling of <a href=\"http:\/\/hdl.handle.net\/11356\/1770\" target=\"_blank\" rel=\"noopener\">standard<\/a> Slovene. You can try out the pipeline at the <a href=\"https:\/\/clarin.si\/oznacevalnik\/eng\" target=\"_blank\" rel=\"noopener\">CLASSLA Annotation tool<\/a> website.<\/p>\n<p>The documentation for the installation and use of the pipeline is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md\" target=\"_blank\" rel=\"noopener\">here<\/a>. Furthermore, the CLASSLA-Stanza pipeline offers some additional features, namely the usage of <a href=\"http:\/\/eng.slovenscina.eu\/tehnologije\/razclenjevalnik\" target=\"_blank\" rel=\"noopener\">Slovene-specific dependency parsing system<\/a>, inflectional lexicon and pretokenized data, which are documented <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.superuser.md\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<p>In addition to this, tokenisation, part-of-speech tagging, and lemmatisation are provided by a CLARIN.SI service <a href=\"https:\/\/clarin.si\/services\/web\/\">ReLDIanno<\/a> as well. This is a legacy system for linguistic annotation that we still keep available for backward compatibility, but we suggest new users to use the above-mentioned CLASSLA-Stanza pipeline.<\/p>\n<h4 id=\"q22\">Q2.2: How can I standardize my texts prior to further processing?<\/h4>\n<p>The <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a>, mentioned above, includes also models for processing of non-standard text, which allows non-standard texts to be annotated before previous standardization.<\/p>\n<p>Currently, the only text on-line normalisation tool available through the <a href=\"http:\/\/clarin.si\/services\/web\/\">CLARIN.SI services<\/a> (ReLDIanno) is the REDI diacritic restorer. Its usage is documented <a href=\"http:\/\/www.clarin.si\/info\/k-centre\/web-services-documentation\/\">here<\/a>. You can also download the <a href=\"https:\/\/github.com\/clarinsi\/redi\" target=\"_blank\" rel=\"noopener\">REDI\u00a0diacritic restorer<\/a>, install it and use it locally.<\/p>\n<p>For word-level normalisation of e.g. historical and user-generated Slovene texts, you can download and install the <a href=\"https:\/\/github.com\/clarinsi\/csmtiser\" target=\"_blank\" rel=\"noopener\">CSMTiser text normaliser<\/a>.<\/p>\n<h4 id=\"q23\">Q2.3: How can I annotate my texts for named entities?<\/h4>\n<p>Named entity recognition is provided by the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a>, which also offers off-the-shelf models for <a href=\"http:\/\/hdl.handle.net\/11356\/1321\" target=\"_blank\" rel=\"noopener\">standard<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1339\" target=\"_blank\" rel=\"noopener\">non-standard<\/a> Slovene. In addition to this, on-line NER is available via the CLARIN.SI service <a href=\"http:\/\/www.clarin.si\/info\/k-centre\/web-services-documentation\/\">ReLDIanno<\/a>. You can also download the <a href=\"https:\/\/github.com\/clarinsi\/janes-ner\" target=\"_blank\" rel=\"noopener\">janes-ner<\/a> tool.<\/p>\n<h4 id=\"q24\">Q2.4: How can I syntactically parse my texts?<\/h4>\n<p>You can syntactically parse Slovene texts in multiple ways:<\/p>\n<ul>\n<li>by using the state-of-the-art <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> (<a href=\"https:\/\/universaldependencies.org\/u\/overview\/syntax.html\" target=\"_blank\" rel=\"noopener\">Universal Dependencies formalism<\/a>), which also offers off-the-shelf models for <a href=\"http:\/\/hdl.handle.net\/11356\/1769\" target=\"_blank\" rel=\"noopener\">UD dependency parsing<\/a> and <a href=\"http:\/\/hdl.handle.net\/11356\/1764\" target=\"_blank\" rel=\"noopener\">JOS dependency parsing<\/a><\/li>\n<li>by using the CLARIN.SI service <a href=\"http:\/\/www.clarin.si\/info\/k-centre\/web-services-documentation\/\">ReLDIanno<\/a>\u00a0(<a href=\"http:\/\/universaldependencies.org\/u\/overview\/syntax.html\" target=\"_blank\" rel=\"noopener\">Universal Dependencies formalism<\/a>)<\/li>\n<li>by using the <a href=\"http:\/\/ufal.mff.cuni.cz\/udpipe\" target=\"_blank\" rel=\"noopener\">UDPipe tool<\/a> which has off-the-shelf models for many languages, Slovene included (<a href=\"http:\/\/universaldependencies.org\/u\/overview\/syntax.html\" target=\"_blank\" rel=\"noopener\">Universal Dependencies formalism<\/a>)<\/li>\n<li>by using the <a href=\"http:\/\/eng.slovenscina.eu\/tehnologije\/razclenjevalnik\" target=\"_blank\" rel=\"noopener\">Slovene Parser (Raz\u010dlenjevalnik)<\/a> tool (Slovene-specific formalism)<\/li>\n<\/ul>\n<hr class=\"shortcode hr blue\" style=\"width:100%;border-width:3px;\" \/>\n<h2 id=\"training\">3. Datasets to train Slovene annotation tools<\/h2>\n<h4 id=\"q31\">Q3.1: Where can I get word embeddings or pre-trained language models for Slovene?<\/h4>\n<ul>\n<li>The embeddings trained on the largest collection of Slovene textual data (Gigafida, slWaC, JANES, KAS, MaCoCu-sl, etc.) is the <a href=\"http:\/\/hdl.handle.net\/11356\/1791\" target=\"_blank\" rel=\"noopener\">CLARIN.SI-embed.sl<\/a> embedding collection.<\/li>\n<li>There are also collections of trained embeddings for Slovene available from <a href=\"https:\/\/embeddings.sketchengine.eu\/static\/index.html\" target=\"_blank\" rel=\"noopener\">SketchEngine<\/a> and from <a href=\"https:\/\/fasttext.cc\/docs\/en\/crawl-vectors.html\" target=\"_blank\" rel=\"noopener\">fastText<\/a>.<\/li>\n<li>If you want to train your own embeddings, the largest freely available collection of Slovene texts is the <a href=\"http:\/\/hdl.handle.net\/11234\/1-1989\" target=\"_blank\" rel=\"noopener\">Slovene portion of Commoncrawl<\/a>.<\/li>\n<\/ul>\n<p>You can also use the Slovene BERT\/RoBERTa model <a href=\"http:\/\/hdl.handle.net\/11356\/1397\" target=\"_blank\" rel=\"noopener\">SloBERTa<\/a>, a state-of-the-art model representing words\/tokens as contextually dependent word embeddings. It allows you to extract word embeddings for every word occurrence, which can then be used in training a model for an end task. The scripts and programs used for data preparation and training the model are available <a href=\"https:\/\/github.com\/clarinsi\/Slovene-BERT-Tool\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h4 id=\"q32\">Q3.2: What data is available for training a text normaliser for Slovene?<\/h4>\n<p>For training text normalisers for Internet Slovene, the <a href=\"http:\/\/hdl.handle.net\/11356\/1733\" target=\"_blank\" rel=\"noopener\">Janes-norm<\/a> dataset can be used. For normalising historical data, the <a href=\"http:\/\/hdl.handle.net\/11356\/1025\" target=\"_blank\" rel=\"noopener\">goo300k<\/a> dataset should be used.<\/p>\n<h4 id=\"q33\">Q3.3: What data is available for training a part-of-speech tagger for Slovene?<\/h4>\n<p>The reference dataset for training a standard tagger is <a href=\"http:\/\/hdl.handle.net\/11356\/1959\" target=\"_blank\" rel=\"noopener\">SUK<\/a>. There is also a silver-standard dataset available, <a href=\"http:\/\/hdl.handle.net\/11356\/1213\" target=\"_blank\" rel=\"noopener\">jos1M<\/a>. There are also training datasets available for Internet Slovene\u00a0(<a href=\"http:\/\/hdl.handle.net\/11356\/1732\" target=\"_blank\" rel=\"noopener\">Janes-tag<\/a>) and for historical Slovene\u00a0(<a href=\"http:\/\/hdl.handle.net\/11356\/1025\" target=\"_blank\" rel=\"noopener\">goo300k<\/a>).<\/p>\n<p>You can also use the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> in combination with the <a href=\"http:\/\/hdl.handle.net\/11356\/1791\" target=\"_blank\" rel=\"noopener\">CLARIN.SI embeddings<\/a> and the training dataset <a href=\"http:\/\/hdl.handle.net\/11356\/1959\" target=\"_blank\" rel=\"noopener\">SUK<\/a> to train and evaluate your own part-of-speech tagger. The documentation is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md#part-of-speech-tagging-1\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h4 id=\"q34\">Q3.4: What data is available for training a lemmatiser for Slovene?<\/h4>\n<p>Lemmatisers can be trained either on the tagger training data (<a href=\"http:\/\/hdl.handle.net\/11356\/1959\" target=\"_blank\" rel=\"noopener\">SUK<\/a>, <a href=\"http:\/\/hdl.handle.net\/11356\/1213\" target=\"_blank\" rel=\"noopener\">jos1M<\/a>, <a href=\"http:\/\/hdl.handle.net\/11356\/1238\" target=\"_blank\" rel=\"noopener\">Janes-tag<\/a>, <a href=\"http:\/\/hdl.handle.net\/11356\/1025\" target=\"_blank\" rel=\"noopener\">goo300k<\/a>, see the <a href=\"#q33\">Section on PoS tagger training<\/a> for details) and\/or on the inflectional lexicon <a href=\"http:\/\/hdl.handle.net\/11356\/1745\" target=\"_blank\" rel=\"noopener\">Sloleks<\/a>.<\/p>\n<p>For training your own lemmatiser for standard and non-standard Slovene, you can use the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a>, which uses the external lexicon for lemmatisation (<a href=\"http:\/\/hdl.handle.net\/11356\/1745\" target=\"_blank\" rel=\"noopener\">Sloleks<\/a>). The documentation is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md#lemmatization\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h4 id=\"q35\">Q3.5: What data is available for training a named entity recogniser for Slovene?<\/h4>\n<p>For training a named entity recogniser of standard language, <a href=\"http:\/\/hdl.handle.net\/11356\/1959\" target=\"_blank\" rel=\"noopener\">SUK<\/a>\u00a0is the best resource. For training NER systems for online, non-standard texts, <a href=\"http:\/\/hdl.handle.net\/11356\/1238\" target=\"_blank\" rel=\"noopener\">Janes-tag<\/a> can be used. Finally, for training historical NER models,\u00a0<a href=\"http:\/\/hdl.handle.net\/11356\/1025\" target=\"_blank\" rel=\"noopener\">goo300k<\/a>\u00a0is the best resource.<\/p>\n<p>The <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> allows you to train your own named entity recogniser as well. The documentation is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md#ner-1\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h4 id=\"q36\">Q3.6: What data is available for training a syntactic parser for Slovene?<\/h4>\n<p>If you want to follow the <a href=\"http:\/\/universaldependencies.org\/u\/overview\/syntax.html\" target=\"_blank\" rel=\"noopener\">Universal Dependencies formalism<\/a> for dependency parsing, the best location for obtaining training data is the <a href=\"https:\/\/github.com\/UniversalDependencies\/UD_Slovenian-SSJ\" target=\"_blank\" rel=\"noopener\">Universal Dependencies repository<\/a>.<\/p>\n<p>For training parsers by following the <a href=\"http:\/\/eng.slovenscina.eu\/tehnologije\/razclenjevalnik\" target=\"_blank\" rel=\"noopener\">Slovene-specific formalism<\/a>, the <a href=\"http:\/\/hdl.handle.net\/11356\/1959\" target=\"_blank\" rel=\"noopener\">SUK<\/a>\u00a0dataset should be used.<\/p>\n<p>You can also use the <a href=\"https:\/\/github.com\/clarinsi\/classla\" target=\"_blank\" rel=\"noopener\">CLASSLA-Stanza pipeline<\/a> to train your own parser. The documentation is available <a href=\"https:\/\/github.com\/clarinsi\/classla\/blob\/main\/README.train.md#parsing-1\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n\n<p>&nbsp;<\/p>\n<div id=\"themify_builder_content-4864\" class=\"themify_builder_content themify_builder_content-4864 themify_builder\" data-postid=\"4864\"><\/div>\n<div id=\"themify_builder_content-3564\" data-postid=\"3564\" class=\"themify_builder_content themify_builder_content-3564 themify_builder\">\n    <\/div>\n<!-- \/themify_builder_content -->\n","protected":false},"excerpt":{"rendered":"<p>This FAQ is part of the documentation of the CLASSLA CLARIN knowledge centre for South Slavic languages. If you notice any missing or wrong information, please do let us know on helpdesk.classla@clarin.si, Subject &#8220;FAQ_Slovene&#8221;. The questions in this FAQ are organised into three main sections: 1. Online Slovene language resources Q1.1: Where can I find [&hellip;]<\/p>\n","protected":false},"author":9,"featured_media":0,"parent":3558,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-3564","page","type-page","status-publish","hentry","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3564","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/9"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=3564"}],"version-history":[{"count":215,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3564\/revisions"}],"predecessor-version":[{"id":5900,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3564\/revisions\/5900"}],"up":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3558"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=3564"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}