{"id":5356,"date":"2021-11-11T14:24:46","date_gmt":"2021-11-11T14:24:46","guid":{"rendered":"https:\/\/www.clarin.si\/info\/?page_id=5356"},"modified":"2021-11-12T09:22:58","modified_gmt":"2021-11-12T09:22:58","slug":"success-stories","status":"publish","type":"page","link":"https:\/\/www.clarin.si\/info\/k-centre\/success-stories\/","title":{"rendered":"Success stories"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">We measure our success along two dimensions: 1. availability of language resources and technologies for South Slavic languages and, equally important, 2. the creation of these resources and technologies through a joint, synergistic effort. We consider the second dimension to be vastly important for three reasons: 1. collaboration ensures transfer of knowledge across the community, 2. collaboration also ensures the comparability of the resulting resources and technologies, simplifying the future development and usage of language technologies, and 3. the funding for South Slavic languages is limited, and not insisting on exploiting the large synergistic potential will keep this language group lagging behind the more rich Western European languages.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">While we do not have that many success stories to report for CLASSLA yet (except for the <\/span><a href=\"https:\/\/pypi.org\/project\/classla\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA pipeline<\/span><\/a><span style=\"font-weight: 400;\">, the <\/span><a href=\"https:\/\/huggingface.co\/classla\/bcms-bertic\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">BERTi\u0107 model<\/span><\/a>,<span style=\"font-weight: 400;\"> and the <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1427\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA Wikipedia corpora<\/span><\/a><span style=\"font-weight: 400;\">, that just partially satisfy the second, synergy-related success criterion), for now we report the previous successful synergistic activities that have directly motivated us to set up CLASSLA in the first place. All the presented success stories have one common denominator &#8211; the <\/span><a href=\"https:\/\/reldi.spur.uzh.ch\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">ReLDI project<\/span><\/a><span style=\"font-weight: 400;\">, so we got used to refer to the underlying synergistic phenomenon as the <\/span><i><span style=\"font-weight: 400;\">ReLDI effect<\/span><\/i><span style=\"font-weight: 400;\">. We hope that we will be able to refer to this phenomenon as the <\/span><i><span style=\"font-weight: 400;\">CLASSLA effect<\/span><\/i><span style=\"font-weight: 400;\"> in the near future.<\/span><\/p>\n<p>Below, we report on the following success stories:<\/p>\n\n<h2><b>Training datasets for linguistic processing of non-standard Croatian and Serbian<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">In Slovenia, the national project <\/span><a href=\"https:\/\/nl.ijs.si\/janes\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">JANES<\/span><\/a><span style=\"font-weight: 400;\"> (2014-2018) had, inter alia, the goal to produce training data for developing language tools for processing non-standard, Internet Slovenian. This was successfully done in the form of the <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1238\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Janes-Tag dataset<\/span><\/a><span style=\"font-weight: 400;\"> that includes 75 thousand tokens, manually annotated for <\/span><span style=\"font-weight: 400;\">tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition. As the researchers from the ReLDI project, which focused on joint development of Croatian and Serbian resources and technologies, collaborated closely with the researchers from the JANES project, the idea emerged to develop similar training datasets for Croatian and Serbian to those for Slovenian. At the end, they were produced with a fraction of the costs needed to produce the Slovenian training data since the endeavour exploited 1. the same <\/span><a href=\"https:\/\/aclanthology.org\/L14-1642\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Twitter data collection method<\/span><\/a><span style=\"font-weight: 400;\">, 2. the same manual annotation technology, and, finally, 3. very complex annotation guidelines which were just adapted to the two additional languages. The third point proved to be very important as the grammatical tradition of all the three languages does not cover non-standard language phenomena, which required for the annotation guidelines to be very precise on all levels. This synergy resulted in training data for non-standard Internet Croatian, the <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1241\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">ReLDI-NormTagNER-hr dataset<\/span><\/a><span style=\"font-weight: 400;\">, and the comparable training data for Serbian, the <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1240\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">ReLDI-NormTagNER-sr dataset<\/span><\/a><span style=\"font-weight: 400;\">. These datasets enabled, inter alia, for the current <\/span><a href=\"https:\/\/pypi.org\/project\/classla\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA pipeline<\/span><\/a><span style=\"font-weight: 400;\"> to have a special processing mode for non-standard Internet language, not just for Slovenian, but also for Croatian and Serbian.<\/span><\/p>\n<h2><b>Training datasets for linguistic processing of standard Croatian and Serbian<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Obtaining high-quality training data for standard Croatian was in no way a simple task. It started as <\/span><a href=\"https:\/\/aclanthology.org\/L14-1542\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">an informal project of two researchers<\/span><\/a><span style=\"font-weight: 400;\"> resulting in the SETimes.HR dataset that was further extended and improved through various internationally-funded projects to the current <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1183\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">hr500k dataset<\/span><\/a><span style=\"font-weight: 400;\">. While early experiments already proved Croatian training data <\/span><a href=\"https:\/\/aclanthology.org\/W15-5301\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">to be highly effective for training models for processing Serbian<\/span><\/a><span style=\"font-weight: 400;\">, the researchers from the <\/span><a href=\"https:\/\/reldi.spur.uzh.ch\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">ReLDI project<\/span><\/a><span style=\"font-weight: 400;\"> were fully aware of the need for dedicated Serbian training data. This is how the <\/span><a href=\"https:\/\/vukbatanovic.github.io\/pdf\/JTDH_SR_2018.pdf\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">SETimes.SR dataset came into existence<\/span><\/a><span style=\"font-weight: 400;\">, exploiting 1. the same data source as the Croatian training dataset, 2. mostly the same annotation guidelines and annotation technology, and 3. the models trained over the Croatian data for highly accurate pre-annotation of the Serbian dataset. Again, by being synergistically smart, a new dataset, <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1200\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">the SETimes.SR dataset<\/span><\/a><span style=\"font-weight: 400;\"> has been produced for a fraction of the full cost of producing such dataset. Today, similarly to the non-standard data presented in the previous success story, these two datasets enable basic linguistic processing of Croatian and Serbian through the <\/span><a href=\"https:\/\/pypi.org\/project\/classla\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA pipeline<\/span><\/a><span style=\"font-weight: 400;\">, as well as through other pipelines based on the <\/span><a href=\"https:\/\/universaldependencies.org\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">Universal Dependencies project<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h2><b>Inflectional lexicons of Croatian and Serbian<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">During the development of the Croatian standard language training dataset, the development of the <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1232\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">hrLex inflectional lexicon<\/span><\/a><span style=\"font-weight: 400;\"> was performed <\/span><a href=\"https:\/\/aclanthology.org\/R15-1050\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">by exploiting machine learning and large corpora<\/span><\/a><span style=\"font-weight: 400;\">. Inside the <\/span><a href=\"https:\/\/reldi.spur.uzh.ch\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">ReLDI project<\/span><\/a>,<span style=\"font-weight: 400;\"> the idea emerged to expand these activities to the Serbian language as well, especially given the fact that, on the side of inflectional morphology, the differences between the two languages are rather limited. This is how yet another crucial resource for the Serbian language came into existence for a fraction of the full cost, the <\/span><a href=\"http:\/\/hdl.handle.net\/11356\/1233\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">srLex inflectional lexicon<\/span><\/a><span style=\"font-weight: 400;\">. It is almost needless to say at this point that these lexicons improve the lemmatisation performance of the <\/span><a href=\"https:\/\/pypi.org\/project\/classla\/\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">CLASSLA pipeline<\/span><\/a><span style=\"font-weight: 400;\"> for Croatian and Serbian.<\/span><\/p>\n<!--themify_builder_content-->\n<div id=\"themify_builder_content-5356\" data-postid=\"5356\" class=\"themify_builder_content themify_builder_content-5356 themify_builder tf_clear\">\n    <\/div>\n<!--\/themify_builder_content-->\n","protected":false},"excerpt":{"rendered":"<p>We measure our success along two dimensions: 1. availability of language resources and technologies for South Slavic languages and, equally important, 2. the creation of these resources and technologies through a joint, synergistic effort. We consider the second dimension to be vastly important for three reasons: 1. collaboration ensures transfer of knowledge across the community, [&hellip;]<\/p>\n","protected":false},"author":13,"featured_media":0,"parent":3558,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-5356","page","type-page","status-publish","hentry","has-post-title","has-post-date","has-post-category","has-post-tag","has-post-comment","has-post-author",""],"_links":{"self":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/5356","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/comments?post=5356"}],"version-history":[{"count":11,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/5356\/revisions"}],"predecessor-version":[{"id":5367,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/5356\/revisions\/5367"}],"up":[{"embeddable":true,"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/pages\/3558"}],"wp:attachment":[{"href":"https:\/\/www.clarin.si\/info\/wp-json\/wp\/v2\/media?parent=5356"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}