Show simple item record

 
dc.contributor.author Škvorc, Tadej
dc.contributor.author Gantar, Polona
dc.contributor.author Robnik-Šikonja, Marko
dc.date.accessioned 2020-09-23T08:27:34Z
dc.date.available 2020-09-23T08:27:34Z
dc.date.issued 2020-07-27
dc.identifier.uri http://hdl.handle.net/11356/1335
dc.description SloIE is a manually labelled dataset of Slovene idiomatic expressions. It contains 29,400 sentences with 75 different expressions that can occur with either a literal or an idiomatic meaning, with appropriate manual annotations for each token. The idiomatic expressions were selected from the Slovene Lexical Database (http://hdl.handle.net/11356/1030). We selected only expressions that can occur with both a literal and an idiomatic meaning. The sentences were extracted from the Gigafida corpus. For each sentence, the file first contains the text of the sentence prefixed by #. This is followed by a line of numbers indicating the positions of tokens that belong to the expression. The numbers also indicate the word order for expressions where the word order is flexible. They are ordered according to the dictionary form of the expression (e.g., the first number indicates the position where the first word of the expression - in its dictionary form - occurs). Each token is labelled with either 'DA', indicating tokens in an expression that have an idiomatic meaning, 'NE', indicating tokens in an expression that have a literal meaning, or '*', indicating tokens outside the expression. Additionally, 'NEJASEN ZGLED' indicates tokens where the annotators could not determine the meaning from the example sentence. Each token is also tagged with the dictionary form of the expression that is present in the sentence. Key reference: Škvorc, Tadej, Polona Gantar, and Marko Robnik-Šikonja. "MICE: Mining Idioms with Contextual Embeddings." arXiv preprint arXiv:2008.05759 (2020).
dc.language.iso slv
dc.publisher Faculty of Computer and Information Science, University of Ljubljana
dc.relation info:eu-repo/grantAgreement/EC/H2020/825153
dc.relation.isreferencedby https://arxiv.org/abs/2008.05759
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.label PUB
dc.subject multiword expressions
dc.subject manual annotation
dc.subject idiomatic expressions
dc.title Dataset of Slovene idiomatic expressions SloIE
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Tadej Skvorc Tadej.skvorc@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana
sponsor ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
sponsor University of Ljubljana P6-0215 Slovene Language - Basic, Contrastive, and Applied Studies nationalFunds
sponsor ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds
sponsor European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153
size.info 29400 sentences
size.info 695636 tokens
files.count 1
files.size 4425132


 Files in this item

Icon
Name
SloIE.zip
Size
4.22 MB
Format
application/zip
Description
SloIE dataset
MD5
f7534b47a852631641d8f3c496ef3a10
 Download file  Preview
 File Preview  
    • SloIE.txt21 MB

Show simple item record