Show simple item record

 
dc.contributor.author Ljubešić, Nikola
dc.contributor.author Mozetič, Igor
dc.contributor.author Cinelli, Matteo
dc.contributor.author Kralj Novak, Petra
dc.date.accessioned 2021-10-28T08:52:20Z
dc.date.available 2021-10-28T08:52:20Z
dc.date.issued 2021-10-14
dc.identifier.uri http://hdl.handle.net/11356/1454
dc.description We present an English YouTube dataset manually annotated for hate speech types and targets. The comments to be annotated were sampled from the English YouTube comments on videos about the Covid-19 pandemic in the period from January 2020 to May 2020. Two sets were annotated: a training set with 51,655 comments (IMSyPP_EN_YouTube_comments_train.csv) and two evaluation sets, one annotated in-context (IMSyPP_EN_YouTube_comments_evaluation_context.csv), another out-of-context (IMSyPP_EN_YouTube_comments_evaluation_no_context.csv), each based on the same 10,759 comments. The dataset was annotated by 10 annotators with most (99.9%) of the comments being annotated by two annotators. It was used to train a classification model for hate speech types detection that is publicly available at the following URL: https://huggingface.co/IMSyPP/hate_speech_en. The dataset consists of the following fields: Video_ID - YouTube ID of the video under which the comment was posted Comment_ID - YouTube ID of the comment Text - text of the comment Type - type of hate speech Target - the target of hate speech Annotator - code of the human annotator
dc.language.iso eng
dc.publisher Jožef Stefan Institute
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label PUB
dc.source.uri http://imsypp.ijs.si/
dc.subject hate speech
dc.subject offensive language
dc.subject YouTube
dc.subject social media
dc.title English YouTube Hate Speech Corpus
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
has.files yes
branding CLARIN.SI data & tools
contact.person Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute
sponsor European Union’s Rights,Equality and Citizenship Programme 875263 IMSyPP - Innovative Monitoring Systems and PreventionPolicies of Online Hate Speech Other
sponsor ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds
sponsor Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds
size.info 62414 texts
size.info 146345 items
files.count 3
files.size 32071100


 Files in this item

 Download all files in item (30.59 MB)
This item is
Publicly Available
and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Name
IMSyPP_EN_YouTube_comments_train.csv
Size
21.15 MB
Format
CSV file
Description
English Hate Speech YouTube Dataset - Training Set
MD5
c6c9c413209df08886fc786af53e10f6
 Download file
Icon
Name
IMSyPP_EN_YouTube_comments_evaluation_context.csv
Size
4.74 MB
Format
CSV file
Description
English Hate Speech YouTube Dataset - In-context Evaluation Set
MD5
6c4ded2e05326120c66d00eb2f95bb89
 Download file
Icon
Name
IMSyPP_EN_YouTube_comments_evaluation_no_context.csv
Size
4.69 MB
Format
CSV file
Description
English Hate Speech YouTube Dataset - No-context Evaluation Set
MD5
1ef648e80cfa240146f049d2d0adb941
 Download file

Show simple item record