2026-07-17T01:37:21Zhttp://www.clarin.si/repository/oai/request

oai:www.clarin.si:11356/14542023-03-27T17:01:19Zhdl_11356_1023hdl_11356_1024

English YouTube Hate Speech Corpus Ljubešić, Nikola Mozetič, Igor Cinelli, Matteo Kralj Novak, Petra hate speech offensive language YouTube social media We present an English YouTube dataset manually annotated for hate speech types and targets. The comments to be annotated were sampled from the English YouTube comments on videos about the Covid-19 pandemic in the period from January 2020 to May 2020. Two sets were annotated: a training set with 51,655 comments (IMSyPP_EN_YouTube_comments_train.csv) and two evaluation sets, one annotated in-context (IMSyPP_EN_YouTube_comments_evaluation_context.csv), another out-of-context (IMSyPP_EN_YouTube_comments_evaluation_no_context.csv), each based on the same 10,759 comments. The dataset was annotated by 10 annotators with most (99.9%) of the comments being annotated by two annotators. It was used to train a classification model for hate speech types detection that is publicly available at the following URL: https://huggingface.co/IMSyPP/hate_speech_en. The dataset consists of the following fields: Video_ID - YouTube ID of the video under which the comment was posted Comment_ID - YouTube ID of the comment Text - text of the comment Type - type of hate speech Target - the target of hate speech Annotator - code of the human annotator 2021-10-14 corpus http://hdl.handle.net/11356/1454 eng Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) https://creativecommons.org/licenses/by-sa/4.0/ PUB text/plain; charset=utf-8 text/csv text/csv text/csv downloadable_files_count: 3 Jožef Stefan Institute http://imsypp.ijs.si/