Dataset of Slovene idiomatic expressions SloIE

Name: Dataset of Slovene idiomatic expressions SloIE
License: https://creativecommons.org/licenses/by-nc-sa/4.0/

Škvorc, Tadej; Gantar, Polona; Robnik-Šikonja, Marko

Prikaži enostavni zapis vnosa

dc.contributor.author	Škvorc, Tadej
dc.contributor.author	Gantar, Polona
dc.contributor.author	Robnik-Šikonja, Marko
dc.date.accessioned	2020-09-23T08:27:34Z
dc.date.available	2020-09-23T08:27:34Z
dc.date.issued	2020-07-27
dc.identifier.uri	http://hdl.handle.net/11356/1335
dc.description	SloIE is a manually labelled dataset of Slovene idiomatic expressions. It contains 29,400 sentences with 75 different expressions that can occur with either a literal or an idiomatic meaning, with appropriate manual annotations for each token. The idiomatic expressions were selected from the Slovene Lexical Database (http://hdl.handle.net/11356/1030). We selected only expressions that can occur with both a literal and an idiomatic meaning. The sentences were extracted from the Gigafida corpus. For each sentence, the file first contains the text of the sentence prefixed by #. This is followed by a line of numbers indicating the positions of tokens that belong to the expression. The numbers also indicate the word order for expressions where the word order is flexible. They are ordered according to the dictionary form of the expression (e.g., the first number indicates the position where the first word of the expression - in its dictionary form - occurs). Each token is labelled with either 'DA', indicating tokens in an expression that have an idiomatic meaning, 'NE', indicating tokens in an expression that have a literal meaning, or '*', indicating tokens outside the expression. Additionally, 'NEJASEN ZGLED' indicates tokens where the annotators could not determine the meaning from the example sentence. Each token is also tagged with the dictionary form of the expression that is present in the sentence. Key reference: Škvorc, Tadej, Polona Gantar, and Marko Robnik-Šikonja. "MICE: Mining Idioms with Contextual Embeddings." arXiv preprint arXiv:2008.05759 (2020).
dc.language.iso	slv
dc.publisher	Faculty of Computer and Information Science, University of Ljubljana
dc.relation	info:eu-repo/grantAgreement/EC/H2020/825153
dc.relation.isreferencedby	https://arxiv.org/abs/2008.05759
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label	PUB
dc.subject	multiword expressions
dc.subject	manual annotation
dc.subject	idiomatic expressions
dc.title	Dataset of Slovene idiomatic expressions SloIE
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Tadej Skvorc Tadej.skvorc@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor	University of Ljubljana P6-0215 Slovene Language - Basic, Contrastive, and Applied Studies nationalFunds
sponsor	ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
sponsor	European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
size.info	29400 sentences
size.info	695636 tokens
files.count	1
files.size	4425132