Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije
Common Language Resources and Technology Infrastructure, Slovenia

Success stories

We measure our success along two dimensions: 1. availability of language resources and technologies for South Slavic languages and, equally important, 2. the creation of these resources and technologies through a joint, synergistic effort. We consider the second dimension to be vastly important for three reasons: 1. collaboration ensures transfer of knowledge across the community, 2. collaboration also ensures the comparability of the resulting resources and technologies, simplifying the future development and usage of language technologies, and 3. the funding for South Slavic languages is limited, and not insisting on exploiting the large synergistic potential will keep this language group lagging behind the more rich Western European languages.

While we do not have that many success stories to report for CLASSLA yet (except for the CLASSLA pipeline, the BERTić model, and the CLASSLA Wikipedia corpora, that just partially satisfy the second, synergy-related success criterion), for now we report the previous successful synergistic activities that have directly motivated us to set up CLASSLA in the first place. All the presented success stories have one common denominator – the ReLDI project, so we got used to refer to the underlying synergistic phenomenon as the ReLDI effect. We hope that we will be able to refer to this phenomenon as the CLASSLA effect in the near future.

Below, we report on the following success stories:

Training datasets for linguistic processing of non-standard Croatian and Serbian

In Slovenia, the national project JANES (2014-2018) had, inter alia, the goal to produce training data for developing language tools for processing non-standard, Internet Slovenian. This was successfully done in the form of the Janes-Tag dataset that includes 75 thousand tokens, manually annotated for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition. As the researchers from the ReLDI project, which focused on joint development of Croatian and Serbian resources and technologies, collaborated closely with the researchers from the JANES project, the idea emerged to develop similar training datasets for Croatian and Serbian to those for Slovenian. At the end, they were produced with a fraction of the costs needed to produce the Slovenian training data since the endeavour exploited 1. the same Twitter data collection method, 2. the same manual annotation technology, and, finally, 3. very complex annotation guidelines which were just adapted to the two additional languages. The third point proved to be very important as the grammatical tradition of all the three languages does not cover non-standard language phenomena, which required for the annotation guidelines to be very precise on all levels. This synergy resulted in training data for non-standard Internet Croatian, the ReLDI-NormTagNER-hr dataset, and the comparable training data for Serbian, the ReLDI-NormTagNER-sr dataset. These datasets enabled, inter alia, for the current CLASSLA pipeline to have a special processing mode for non-standard Internet language, not just for Slovenian, but also for Croatian and Serbian.

Training datasets for linguistic processing of standard Croatian and Serbian

Obtaining high-quality training data for standard Croatian was in no way a simple task. It started as an informal project of two researchers resulting in the SETimes.HR dataset that was further extended and improved through various internationally-funded projects to the current hr500k dataset. While early experiments already proved Croatian training data to be highly effective for training models for processing Serbian, the researchers from the ReLDI project were fully aware of the need for dedicated Serbian training data. This is how the SETimes.SR dataset came into existence, exploiting 1. the same data source as the Croatian training dataset, 2. mostly the same annotation guidelines and annotation technology, and 3. the models trained over the Croatian data for highly accurate pre-annotation of the Serbian dataset. Again, by being synergistically smart, a new dataset, the SETimes.SR dataset has been produced for a fraction of the full cost of producing such dataset. Today, similarly to the non-standard data presented in the previous success story, these two datasets enable basic linguistic processing of Croatian and Serbian through the CLASSLA pipeline, as well as through other pipelines based on the Universal Dependencies project.

Inflectional lexicons of Croatian and Serbian

During the development of the Croatian standard language training dataset, the development of the hrLex inflectional lexicon was performed by exploiting machine learning and large corpora. Inside the ReLDI project, the idea emerged to expand these activities to the Serbian language as well, especially given the fact that, on the side of inflectional morphology, the differences between the two languages are rather limited. This is how yet another crucial resource for the Serbian language came into existence for a fraction of the full cost, the srLex inflectional lexicon. It is almost needless to say at this point that these lexicons improve the lemmatisation performance of the CLASSLA pipeline for Croatian and Serbian.