Slovenian Twitter hate speech dataset IMSyPP-sl

Name: Slovenian Twitter hate speech dataset IMSyPP-sl
License: https://creativecommons.org/licenses/by-sa/4.0/

Kralj Novak, Petra; Mozetič, Igor; Ljubešić, Nikola

dc.contributor.author	Kralj Novak, Petra
dc.contributor.author	Mozetič, Igor
dc.contributor.author	Ljubešić, Nikola
dc.date.accessioned	2021-02-25T16:52:02Z
dc.date.available	2021-02-25T16:52:02Z
dc.date.issued	2021-02-17
dc.identifier.uri	http://hdl.handle.net/11356/1398
dc.description	A hand-labeled training (50,000 tweets labeled twice) and evaluation set (10,000 tweets labeled twice) for hate speech on Slovenian Twitter. The data files contain tweet IDs, hate speech type, hate speech target, and annotator ID. For obtaining the full text of the dataset, please contact the first author. Hate speech type: 1. Appropriate - has no target 2. Inappropriate (contains terms that are obscene, vulgar; but the text is not directed at any person specifically) - has no target 3. Offensive (including offensive generalization, contempt, dehumanization, indirect offensive remarks) 4. Violent (author threatens, indulges, desires, or calls for physical violence against a target; it also includes calling for, denying, or glorifying war crimes and crimes against humanity) Hate speech target: 1. Racism (intolerance based on nationality, ethnicity, language, towards foreigners; and based on race, skin color) 2. Migrants (intolerance of refugees or migrants, offensive generalization, call for their exclusion, restriction of rights, non-acceptance, denial of assistance…) 3. Islamophobia (intolerance towards Muslims) 4. Antisemitism (intolerance of Jews; also includes conspiracy theories, Holocaust denial or glorification, offensive stereotypes…) 5. Religion (other than above) 6. Homophobia (intolerance based on sexual orientation and / or identity, calls for restrictions on the rights of LGBTQ persons 7. Sexism (offensive gender-based generalization, misogynistic insults, unjustified gender discrimination) 8. Ideology (intolerance based on political affiliation, political belief, ideology… e.g. “communists”, “leftists”, “home defenders”, “socialists”, “activists for…”) 9. Media (journalists and media, also includes allegations of unprofessional reporting, false news, bias) 10. Politics (intolerance towards individual politicians, authorities, system, political parties) 11. Individual (intolerance toward any other individual due to individual characteristics; like commentator, neighbor, acquaintance ) 12. Other (intolerance towards members of other groups due to belonging to this group; write in the blank column on the right which group it is) Training dataset The training set is sampled from data collected between December 2017 and February 2020. The sampling was intentionally biased to contain as much hate speech as possible. A simple model was used to flag potential hate speech content and additionally, filtering by users and by tweet length (number of characters) was applied. 50,000 tweets were selected for annotation. Evaluation dataset The evaluation set is sampled from data collected between February 2020 and August 2020. Contrary to the training set, the evaluation set is an unbiased random sample. Since the evaluation set is from a later period compared to the training set, the possibility of data linkage is minimized. Furthermore, the estimates of model performance made on the evaluation set are realistic, or even pessimistic, since the evaluation set is characterized by a new topic: Covid-19. 10,000 tweets were selected for the evaluation set. Annotation results Each tweet was annotated twice: In 90% of the cases by two different annotators and in 10% of the cases by the same annotator. Special attention was devoted to evening out the overlap between annotators to get agreement estimates on equally sized sets. Ten annotators were engaged for our annotation campaign. They were given annotation guidelines, a training session, and a test on a small set to evaluate their understanding of the task and their commitment before starting the annotation procedure. Annotator agreement in terms of Krippendorff Alpha is around 0.6. Annotation agreement scores are detailed in the accompanying report files for each dataset separately. The annotation process lasted four months, and it required about 1,200 person-hours for the ten annotators to complete the task.
dc.language.iso	slv
dc.publisher	Jožef Stefan Institute
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	https://creativecommons.org/licenses/by-sa/4.0/
dc.rights.label	PUB
dc.source.uri	http://imsypp.ijs.si/
dc.subject	hate speech
dc.subject	Twitter
dc.subject	offensive language
dc.subject	inappropriate language
dc.subject	violent language
dc.subject	manual annotation
dc.title	Slovenian Twitter hate speech dataset IMSyPP-sl
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
has.files	yes
branding	CLARIN.SI data & tools
contact.person	Petra Kralj Novak Petra.Kralj.Novak@ijs.si Jožef Stefan Institute
sponsor	European Union’s Rights,Equality and Citizenship Programme 875263 IMSyPP - Innovative Monitoring Systems and PreventionPolicies of Online Hate Speech Other
sponsor	ARRS (Slovenian Research Agency) P2-103 Knowledge Technologies nationalFunds
sponsor	ARRS (Slovenian Research Agency) P6-0411 Language Resources and Technologies for Slovene nationalFunds
size.info	120000 items
files.count	4
files.size	5442842

Files in this item

Download all files in item (5.19 MB)

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: IMSyPP_SI_anotacije_training-clarin.csv
Size: 4.27 MB
Format: CSV file
Description: Training dataset: Slovenian Twitter sample labeled for hate speech type and target.
MD5: 0fe160a1e9ab82a723ba9b2ffee90f78

Download file

Name: IMSyPP_SI_anotacije_evaluation-report.txt
Size: 27.86 KB
Format: Text file
Description: Evaluation dataset: annotation agreement scores for the evaluation dataset.
MD5: a84c9b7a6c5cf99d0402160ec0d70273

Download file Preview

File Preview

Agreement report for both annotation questions: hate speech type (vrsta) and target (tarča) for the data in the file IMSyPP_SI_anotacije_evaluation-clarin.csv.

Hate speech types (vrsta):
0 appropriate (ni sporni govor)
1 inappropriate (nespodobni govor)
2 offensive (žalitev)
3 violent (nasilje)

Hate speech targets (tarča):
1 racism (ksenofobija in rasizem)
2 migrants (begunci/migranti)
3 islamophobia (islamofobija)
4 antisemitism (antisemitizem)
5 religion (druge religije)
6 homophobia (homofobija)
7 sexism (seksizem)
8 ideology (ideologija)
9 media (novinarji in mediji)
10 politics (politika/-i)
11 individual (posameznik)
12 other (drugo)

Annotated instances

a0 2000/2000
a1 2000/2000
a2 2000/2000
a3 2000/2000
a4 2000/2000
a5 2000/2000
a6 2000/2000
a7 2000/2000
a8 2000/2000
a9 2000/2000

-----------------
-----OVERALL-----
-----------------
Annotated for  vrsta : 20000
0 ni sporni govor     13273
1 nespodobni govor      285
2 žalitev . . .

Name: IMSyPP_SI_anotacije_evaluation-clarin.csv
Size: 883.06 KB
Format: CSV file
Description: Evaluation dataset: Slovenian Twitter random sample labeled for hate speech type and target.
MD5: d1a0daa22905e4f1b582cd324b4c0074

Download file

Name: IMSyPP_SI_anotacije_training-report.txt
Size: 27.49 KB
Format: Text file
Description: Training dataset: annotation agreement scores for the training dataset.
MD5: 756517ba847f27ea5decae1f1bc7f46c

Download file Preview

File Preview

Agreement report for both annotation questions: hate speech type (vrsta) and target (tarča) for the data in the file IMSyPP_SI_anotacije_training-clarin.csv.

Hate speech types (vrsta):
0 appropriate (ni sporni govor)
1 inappropriate (nespodobni govor)
2 offensive (žalitev)
3 violent (nasilje)

Hate speech targets (tarča):
1 racism (ksenofobija in rasizem)
2 migrants (begunci/migranti)
3 islamophobia (islamofobija)
4 antisemitism (antisemitizem)
5 religion (druge religije)
6 homophobia (homofobija)
7 sexism (seksizem)
8 ideology (ideologija)
9 media (novinarji in mediji)
10 politics (politika/-i)
11 individual (posameznik)
12 other (drugo)

Annotated instances

a0   9997 / 10000
a1   9950 / 10000
a2   9929 / 10000
a3   9992 / 10000
a4   10000 / 10000
a5   10000 / 10000
a6   9998 / 10000
a7   9979 / 10000
a8   9973 / 10000
a9   9991 / 10000


-----------------
-----OVERALL-----
-----------------
Annotated for  vrsta : 99809
vrsta
0 ni sporni govo . . .

Show simple item record

Files in this item

Partners

Partners

Repository