2026-05-22T05:58:21Zhttp://www.clarin.si/repository/oai/request

oai:www.clarin.si:11356/12612019-11-25T14:34:07Zhdl_11356_1023hdl_11356_1024

Multilingual Culture-Independent Word Analogy Datasets Ulčar, Matej Vaik, Kristiina Lindström, Jessica Linde, Dace Dailidėnaitė, Milda Šumakov, Andrei analogy word analogies multilingual cross-lingual Word analogy task evaluates word embeddings, based on analagous word pairs (eg. "Paris - France" should be equivalent to "Rome - Italy", "son - daughter" should be equivalent to "brother - sister"). The dataset has been inspired by Mikolov's analogy test set in English (http://download.tensorflow.org/data/questions-words.txt). It was first written for Slovenian and then partially translated and partially done from scratch for the other languages (Croatian, Finnish, Estonian, Swedish, Latvian, Lithuanian, Russian and English). The analogy dataset is composed of fifteen categories, five semantical and ten syntactical. Each dataset has about 19,000 entries. In addition to nine monolingual datasets (one for each language), we also composed 72 cross-lingual datasets (one for each language pair), where one half of the entry (one analogy, eg. "mother-father") is in one language and the other half of the entry (eg. "sister-brother") is in another language. 2019-11-25 lexicalConceptualResource http://hdl.handle.net/11356/1261 slv hrv eng fin est lav lit swe rus info:eu-repo/grantAgreement/EC/H2020/825153 https://arxiv.org/abs/1911.10038 Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) https://creativecommons.org/licenses/by-sa/4.0/ PUB application/zip application/zip text/plain text/plain; charset=utf-8 downloadable_files_count: 3 Faculty of Computer and Information Science, University of Ljubljana http://embeddia.eu