dc.contributor.author | Ljubešić, Nikola |
dc.contributor.author | Mozetič, Igor |
dc.contributor.author | Cinelli, Matteo |
dc.contributor.author | Kralj Novak, Petra |
dc.date.accessioned | 2021-10-28T08:52:20Z |
dc.date.available | 2021-10-28T08:52:20Z |
dc.date.issued | 2021-10-14 |
dc.identifier.uri | http://hdl.handle.net/11356/1454 |
dc.description | We present an English YouTube dataset manually annotated for hate speech types and targets. The comments to be annotated were sampled from the English YouTube comments on videos about the Covid-19 pandemic in the period from January 2020 to May 2020. Two sets were annotated: a training set with 51,655 comments (IMSyPP_EN_YouTube_comments_train.csv) and two evaluation sets, one annotated in-context (IMSyPP_EN_YouTube_comments_evaluation_context.csv), another out-of-context (IMSyPP_EN_YouTube_comments_evaluation_no_context.csv), each based on the same 10,759 comments. The dataset was annotated by 10 annotators with most (99.9%) of the comments being annotated by two annotators. It was used to train a classification model for hate speech types detection that is publicly available at the following URL: https://huggingface.co/IMSyPP/hate_speech_en. The dataset consists of the following fields: Video_ID - YouTube ID of the video under which the comment was posted Comment_ID - YouTube ID of the comment Text - text of the comment Type - type of hate speech Target - the target of hate speech Annotator - code of the human annotator |
dc.language.iso | eng |
dc.publisher | Jožef Stefan Institute |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | https://creativecommons.org/licenses/by-sa/4.0/ |
dc.rights.label | PUB |
dc.source.uri | http://imsypp.ijs.si/ |
dc.subject | hate speech |
dc.subject | offensive language |
dc.subject | YouTube |
dc.subject | social media |
dc.title | English YouTube Hate Speech Corpus |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
has.files | yes |
branding | CLARIN.SI data & tools |
contact.person | Nikola Ljubešić nikola.ljubesic@ijs.si Jožef Stefan Institute |
sponsor | European Union’s Rights,Equality and Citizenship Programme 875263 IMSyPP - Innovative Monitoring Systems and PreventionPolicies of Online Hate Speech Other |
sponsor | ARRS (Slovenian Research Agency) N6-0099 LiLaH: Linguistic Landscape of Hate Speech nationalFunds |
sponsor | Jožef Stefan Institute CLARIN CLARIN.SI nationalFunds |
size.info | 62414 texts |
size.info | 146345 items |
files.count | 3 |
files.size | 32071100 |
Files in this item
Download all files in item (30.59 MB)This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Name
- IMSyPP_EN_YouTube_comments_train.csv
- Size
- 21.15 MB
- Format
- CSV file
- Description
- English Hate Speech YouTube Dataset - Training Set
- MD5
- c6c9c413209df08886fc786af53e10f6
- Name
- IMSyPP_EN_YouTube_comments_evaluation_context.csv
- Size
- 4.74 MB
- Format
- CSV file
- Description
- English Hate Speech YouTube Dataset - In-context Evaluation Set
- MD5
- 6c4ded2e05326120c66d00eb2f95bb89
- Name
- IMSyPP_EN_YouTube_comments_evaluation_no_context.csv
- Size
- 4.69 MB
- Format
- CSV file
- Description
- English Hate Speech YouTube Dataset - No-context Evaluation Set
- MD5
- 1ef648e80cfa240146f049d2d0adb941