English YouTube Hate Speech Corpus

Name: English YouTube Hate Speech Corpus
License: https://creativecommons.org/licenses/by-sa/4.0/

Ljubešić, Nikola; Mozetič, Igor; Cinelli, Matteo; Kralj Novak, Petra

dc.contributor.author	Ljubešić, Nikola
dc.contributor.author	Mozetič, Igor
dc.contributor.author	Cinelli, Matteo
dc.contributor.author	Kralj Novak, Petra
dc.date.accessioned	2021-10-28T08:52:20Z
dc.date.available	2021-10-28T08:52:20Z
dc.date.issued	2021-10-14
dc.identifier.uri	http://hdl.handle.net/11356/1454
dc.description	We present an English YouTube dataset manually annotated for hate speech types and targets. The comments to be annotated were sampled from the English YouTube comments on videos about the Covid-19 pandemic in the period from January 2020 to May 2020. Two sets were annotated: a training set with 51,655 comments (IMSyPP_EN_YouTube_comments_train.csv) and two evaluation sets, one annotated in-context (IMSyPP_EN_YouTube_comments_evaluation_context.csv), another out-of-context (IMSyPP_EN_YouTube_comments_evaluation_no_context.csv), each based on the same 10,759 comments. The dataset was annotated by 10 annotators with most (99.9%) of the comments being annotated by two annotators. It was used to train a classification model for hate speech types detection that is publicly available at the following URL: https://huggingface.co/IMSyPP/hate_speech_en. The dataset consists of the following fields: Video_ID - YouTube ID of the video under which the comment was posted Comment_ID - YouTube ID of the comment Text - text of the comment Type - type of hate speech Target - the target of hate speech Annotator - code of the human annotator
dc.language.iso	eng
dc.publisher	Jožef Stefan Institute
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	http://imsypp.ijs.si/
dc.subject	hate speech
dc.subject	offensive language
dc.subject	YouTube
dc.subject	social media
dc.title	English YouTube Hate Speech Corpus
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor	European Union’s Rights,Equality and Citizenship Programme 875263 IMSyPP - Innovative Monitoring Systems and PreventionPolicies of Online Hate Speech Other
sponsor	ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
sponsor	Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info	62414 texts
size.info	146345 items
files.count	3
files.size	32071100

Files in this item

Download all files in item (30.59 MB)

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: IMSyPP_EN_YouTube_comments_train.csv
Size: 21.15 MB
Format: CSV file
Description: English Hate Speech YouTube Dataset - Training Set
MD5: c6c9c413209df08886fc786af53e10f6

Download file

Name: IMSyPP_EN_YouTube_comments_evaluation_context.csv
Size: 4.74 MB
Format: CSV file
Description: English Hate Speech YouTube Dataset - In-context Evaluation Set
MD5: 6c4ded2e05326120c66d00eb2f95bb89

Download file

Name: IMSyPP_EN_YouTube_comments_evaluation_no_context.csv
Size: 4.69 MB
Format: CSV file
Description: English Hate Speech YouTube Dataset - No-context Evaluation Set
MD5: 1ef648e80cfa240146f049d2d0adb941

Download file

Show simple item record

Files in this item

Partners

Partners

Repository