dc.contributor.author | Škvorc, Tadej |
dc.contributor.author | Gantar, Polona |
dc.contributor.author | Robnik-Šikonja, Marko |
dc.date.accessioned | 2020-09-23T08:27:34Z |
dc.date.available | 2020-09-23T08:27:34Z |
dc.date.issued | 2020-07-27 |
dc.identifier.uri | http://hdl.handle.net/11356/1335 |
dc.description | SloIE is a manually labelled dataset of Slovene idiomatic expressions. It contains 29,400 sentences with 75 different expressions that can occur with either a literal or an idiomatic meaning, with appropriate manual annotations for each token. The idiomatic expressions were selected from the Slovene Lexical Database (http://hdl.handle.net/11356/1030). We selected only expressions that can occur with both a literal and an idiomatic meaning. The sentences were extracted from the Gigafida corpus. For each sentence, the file first contains the text of the sentence prefixed by #. This is followed by a line of numbers indicating the positions of tokens that belong to the expression. The numbers also indicate the word order for expressions where the word order is flexible. They are ordered according to the dictionary form of the expression (e.g., the first number indicates the position where the first word of the expression - in its dictionary form - occurs). Each token is labelled with either 'DA', indicating tokens in an expression that have an idiomatic meaning, 'NE', indicating tokens in an expression that have a literal meaning, or '*', indicating tokens outside the expression. Additionally, 'NEJASEN ZGLED' indicates tokens where the annotators could not determine the meaning from the example sentence. Each token is also tagged with the dictionary form of the expression that is present in the sentence. Key reference: Škvorc, Tadej, Polona Gantar, and Marko Robnik-Šikonja. "MICE: Mining Idioms with Contextual Embeddings." arXiv preprint arXiv:2008.05759 (2020). |
dc.language.iso | slv |
dc.publisher | Faculty of Computer and Information Science, University of Ljubljana |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/825153 |
dc.relation.isreferencedby | https://arxiv.org/abs/2008.05759 |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.label | PUB |
dc.subject | multiword expressions |
dc.subject | manual annotation |
dc.subject | idiomatic expressions |
dc.title | Dataset of Slovene idiomatic expressions SloIE |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Tadej Skvorc Tadej.skvorc@fri.uni-lj.si Faculty of Computer and Information Science, University of Ljubljana |
sponsor | ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds |
sponsor | University of Ljubljana P6-0215 Slovene Language - Basic, Contrastive, and Applied Studies nationalFunds |
sponsor | ARRS (Slovenian Research Agency) J6-8256 New grammar of contemporary standard Slovene: sources and methods nationalFunds |
sponsor | European Union EC/H2020/825153 EMBEDDIA - Cross-Lingual Embeddings for Less-Represented Languages in European News Media euFunds info:eu-repo/grantAgreement/EC/H2020/825153 |
size.info | 29400 sentences |
size.info | 695636 tokens |
files.count | 1 |
files.size | 4425132 |
Datoteke v tem vnosu
To je vnos
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
z licenco:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)





- Ime
- SloIE.zip
- Velikost
- 4.22 MB
- Format
- application/zip
- Opis
- SloIE dataset
- MD5
- f7534b47a852631641d8f3c496ef3a10