Online concordancers – CLARIN Slovenia

	Slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije Common Language Resources and Technology Infrastructure, Slovenia

Concordancers are computer programs that enable the searching and statistical treatment of data in big text collections (corpora).

Kazalo

CLARIN.SI Concordancers

noSketch Engine

KonText

Old noSketch Engine

Other concordancers and corpora

CLARIN.SI Concordancers

CLARIN.SI maintains several concordancers that enable searching over 100 corpora in 30 langauges, and with about 20 billion words. They support searching tagged corpora, displaying and manipulating concordances, creating frequency lists, calculating collocations, saving results of queries etc. They use the same back-end program but differ in their user interfaces.

noSketch Engine

noSketch Engine is an open-source version of the commercial Sketch Engine which was developed by Lexical Computing. Instructions for its use are available here. Please note that the corpora that are offered by CLARIN.SI via noSketch Engine are not the same as those offered by Lexical Computing via Sketch Engine.

CLARIN.SI offers two installations of noSketch Engine:

https://www.clarin.si/ske: log-in is not required (or possible), which simplifies use for less advanced users
https://www.clarin.si/skelog: log-in is required, although everybody can register themselves; logging-in allows subcorpus creation and personalised display of e.g. corpus attributes.

CLARIN.SI would like to thank the personnel of Lexical Computing, in particular, Jan Bušta and Tomáš Svoboda for their help with installing noSketch Engine at CLARIN.SI.

KonText

The KonText concordancer was developed for the purposes of the Czech National Corpus and is openly available on the GitHub platform. A user manual is available here. As opposed to noSketch Engine, KonText offers immediate access to speech recording accompanying spoken corpora, however, it does not support the computation of keywords.

CLARIN.SI offers the following installation of KonText:

https://www.clarin.si/kontext: log-in is not required, although log-in (via AAI e.g. EduGain) is needed to use the more advanced functions of KonText. It also enables the setting of view options for individual corpora, saving of personal subcorpora, a history of queries, etc.

CLARIN.SI would like to thank the personnel of the Czech National Corpus, in particular Tomáš Machálek, for their help with installing KonText at CLARIN.SI.

Old noSketch Engine

The old noSketch Engine (“Bonito”) has a substantially different user interface from the current version, and is no longer maintained by Lexical Computing and also has no user documentation. For the time being the CLARIN.SI installation will continue to be available, as various language resources refer to it, and it also offers some functions that the new noSketch Engine does not, in particular, accessing the results of queries in XML.

The entry point for the old noSketchEngine is:

https://www.clarin.si/noske: no log-in is possible.

CLARIN.SI would like to thank the directors of Lexical Computing, Miloš Jakubíček and Pavel Rychlý, for making their concordancer and esp. the manatee back-end openly available.

Other concordancers and corpora

Some Slovenian reference corpora can, in addition to searching them via the CLARIN.SI concordancers, be also searched through their dedicated concordancers, available at the Center for Language Resources and Technologies at the University of Ljubljana:

Gigafida is a reference corpus of written standard Slovene which includes texts of various genres. Its first version was developed during the Communication in Slovene project from 2007 to 2013, while its upgraded version (v2.0) was published in 2019.
Kres is a balanced subcorpus of the first version of the Gigafida corpus which was created during the Communication in Slovene project.
Gos is a corpus of spoken Slovene which was created during the Communication in Slovene project.

There are other corpora for Slovenian which can be searched using their specialised concordancers:

Evrokorpus is a collection of parallel bilingual corpora of Slovene translations of EU legislation. The collection is linked to Evroterm – a multilingual terminology base.
The corpus of tourist-related texts TURK is a multilingual (Slovenian, Italian, English) corpus that was compiled in the scope of the Scientific research centre of the University of Primorska.
Nova beseda is a Slovenian corpus that was created by the Institute of Slovenian Language ZRC SAZU.