The Orange workflow for observing collocation clusters ColEmbed 1.0

Name: The Orange workflow for observing collocation clusters ColEmbed 1.0
License: https://opensource.org/licenses/Apache-2.0

Kosem, Iztok; Čibej, Jaka; Ljubešić, Nikola; Krek, Simon; Gantar, Polona; Arhar Holdt, Špela; Logar, Nataša; Laskowski, Cyprian; Klemenc, Bojan; Dobrovoljc, Kaja; Gorjanc, Vojko; Pori, Eva

Show simple item record

dc.contributor.author	Kosem, Iztok
dc.contributor.author	Čibej, Jaka
dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Krek, Simon
dc.contributor.author	Gantar, Polona
dc.contributor.author	Arhar Holdt, Špela
dc.contributor.author	Logar, Nataša
dc.contributor.author	Laskowski, Cyprian
dc.contributor.author	Klemenc, Bojan
dc.contributor.author	Dobrovoljc, Kaja
dc.contributor.author	Gorjanc, Vojko
dc.contributor.author	Pori, Eva
dc.date.accessioned	2021-05-07T13:17:33Z
dc.date.available	2021-05-07T13:17:33Z
dc.date.issued	2020-10-26
dc.identifier.uri	http://hdl.handle.net/11356/1425
dc.description	The Orange Workflow for Observing Collocation Clusters ColEmbed 1.0 ColEmbed is a workflow (.OWS file) for Orange Data Mining (an open-source machine learning and data visualization software: https://orangedatamining.com/) that allows the user to observe clusters of collocation candidates extracted from corpora. The workflow consists of a series of data filters, embedding processors, and visualizers. As input, the workflow takes a tab-separated file (.TSV/.TAB) with data on collocations extracted from a corpus, along with their relative frequencies by year of publication and other optional values (such as information on temporal trends). The workflow allows the user to select the features which are then used in the workflow to cluster collocation candidates, along with the embeddings generated based on the selected lemmas (either one lemma or both lemmas can be selected, depending on our clustering criteria; for instance, if we wish to cluster adjective+noun candidates based on the similarities of their noun components, we only select the second lemma to be taken into account in embedding generation). The obtained embedding clusters can be visualized and further processed (e.g. by finding the closest neighbors of a reference collocation). The workflow is described in more detail in the accompanying README file. The entry also contains three .TAB files that can be used to test the workflow. The files contain collocation candidates (along with their relative frequencies per year of publication and four measures describing their temporal trends; see http://hdl.handle.net/11356/1424 for more details) extracted from the Gigafida 2.0 Corpus of Written Slovene (https://viri.cjvt.si/gigafida/) with three different syntactic structures (as defined in http://hdl.handle.net/11356/1415): 1) p0-s0 (adjective + noun, e.g. rezervni sklad), 2) s0-s2 (noun + noun in the genitive case, e.g. ukinitev lastnine), and 3) gg-s4 (verb + noun in the accusative case, e.g. pripraviti besedilo). It should be noted that only collocation candidates with absolute frequency of 15 and above were extracted. Please note that the ColEmbed workflow requires the installation of the Text Mining add-on for Orange. For installation instructions as well as a more detailed description of the different phases of the workflow and the measures used to observe the collocation trends, please consult the README file.
dc.publisher	Centre for Language Resources and Technologies, University of Ljubljana
dc.rights	Apache License 2.0
dc.rights.uri	https://opensource.org/licenses/Apache-2.0
dc.rights.label	PUB
dc.source.uri	https://www.cjvt.si/kolos/
dc.subject	collocations
dc.subject	clustering
dc.subject	word embeddings
dc.subject	temporal trends
dc.title	The Orange workflow for observing collocation clusters ColEmbed 1.0
dc.type	toolService
metashare.ResourceInfo#ContentInfo.detailedType	tool
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	false
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Iztok Kosem iztok.kosem@cjvt.si Centre for Language Resources and Technologies, University of Ljubljana
sponsor	ARRS (Slovenian Research Agency) J6-8255 Collocations as a basis for language description: semantic and temporal perspectives nationalFunds
sponsor	ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
files.count	1
files.size	90517674

Files in this item

This item is

Publicly Available

and licensed under:
Apache License 2.0

Name: colembed_1.0.zip
Size: 86.32 MB
Format: application/zip
Description: ColEmbed 1.0
MD5: 38f6e77ff928382770be421f01b279f4

Download file Preview

File Preview

colembed_1.0
- colembed_gf2_tab
  - coltrend_gf2_34_p0-s0.tab130 MB
  - coltrend_gf2_23_gg-s4.tab49 MB
  - coltrend_gf2_53_s0-s2.tab88 MB
- colembed_1.0_workflow.ows299 kB
- 00README.txt3 kB

Show simple item record

Files in this item

Partners

Partners

Repository