# CoSimLex: Graded Effects of Context in Similarity Perception CoSimLex is a dataset designed to study the effect of textual context in the perception of similarity between two words. The dataset provides pairs in English and three less resourced European languages: - 340 English pairs: cosimlex_en.tsv - 112 Croatian pairs: cosimlex_hr.tsv - 111 Slovene pairs: cosimlex_sl.tsv - 24 Finnish pairs: cosimlex_fi.tsv In order to produce the contextual similarity scores human annotators were instructed to score the similarity between the two words within the text in which they appear. Each pair is scored within two different contexts (short texts containing the two words), this provides a way to compare the impact of different contexts in the same pair of words. The selection of pairs of words is a subset of the well known SimLex-999 dataset (Hill et al., 2015). The dataset was used to evaluate Semeval-2020 Task3: Graded Word Similarity in Context. For more detailed information: - LREC2020 paper -> https://www.aclweb.org/anthology/2020.lrec-1.720 - Semeval-2020 Task3 -> https://competitions.codalab.org/competitions/20905 - A task description paper will be published in the Proceedings of the 14th International Workshop on Semantic Evaluation. Hill, F., Reichart, R., & Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4), 665-695. ## Format: Tab separated values in the datasets: - **word1** -> First word in the pair. Uninflected form. - **word2** -> Second word in the pair. Uninflected form. - **context1** -> First context containing the pair of words. The targer words are marked with a \\ labels. - **context2** -> Second context containing the pair of words. The targer words are marked with a \\ labels. - **sim1** -> Mean of the similarity scores within the first context. - **sim2** -> Mean of the similarity scores within the second context. - **stdev1** -> Standard Deviation for the scores within the first context. - **stdev2** -> Standard deviation for the scores within the second context. - **pvalue** -> P-value calculated using the Mann-Whitney U test. - **word1_context1** -> Inflected version of the first word that as it appears in the first context. - **word2_context1** -> Inflected version of the second word that as it appears in the first context. - **word1_context2** -> Inflected version of the first word that as it appears in the second context. - **word2_context2** -> Inflected version of the second word that as it appears in the second context. ## Referencing: These resources are freely available for education, research and other non-commercial purposes. Please don't forget to reference our work: @inproceedings{armendariz-etal-2020-semeval, title = "{SemEval-2020} {T}ask 3: Graded Word Similarity in Context ({GWSC})", author = "Armendariz, Carlos S. and Purver, Matthew and Pollak, Senja and Ljube{\v{s}}i{\'{c}}, Nikola and Ul{\v{c}}ar, Matej and Robnik-{\v{S}}ikonja, Marko and Vuli{\'{c}}, Ivan and Pilehvar, Mohammad Taher", booktitle = "Proceedings of the 14th International Workshop on Semantic Evaluation", year = "2020", address="Online" } @InProceedings{armendariz-EtAl:2020:LREC, author = {Armendariz, Carlos S. and Purver, Matthew and Ulčar, Matej and Pollak, Senja and Ljubešić, Nikola and Granroth-Wilding, Mark}, title = {{CoSimLex}: A Resource for Evaluating Graded Word Similarity in Context}, booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference}, month = {May}, year = {2020}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {5878--5886}, url = {https://www.aclweb.org/anthology/2020.lrec-1.720} } ## Contact: Carlos S. Armendariz c.santosarmendariz@qmul.ac.uk carlos@santosarmendariz.com