Slovenian Twitter dataset 2018-2020 1.0

Slovenian Twitter dataset 2018-2020 1.0

CLARIN.SI data & tools

Authors: Evkoski, Bojan ; Pelicon, Andraž ; Mozetič, Igor ; Ljubešić, Nikola and Kralj Novak, Petra

Item identifier: http://hdl.handle.net/11356/1423

Project URL: http://imsypp.ijs.si

Referenced by: https://arxiv.org/pdf/2105.14898.pdf
https://arxiv.org/pdf/2105.06214.pdf

Date issued: 2021-07-20

Type: corpus, text

Size: 12961136 texts

Language(s): Slovenian

Description: The dataset represents the Twitter production in Slovenian in the period from 2018 until 2020. It consists of tweet IDs, retweet IDs, pseudo-anonymized user IDs, publication dates, and automatically assigned hate labels (acceptable, inappropriate, offensive, violent) with https://huggingface.co/IMSyPP/hate_speech_slo. The dataset is the basis for the two following papers: - "Retweet communities reveal the main source of hate speech" - https://arxiv.org/pdf/2105.14898.pdf - "Community evolution in retweet networks" - https://arxiv.org/pdf/2105.06214.pdf

Publisher: Jožef Stefan Institute

Subject(s): Twitter hate speech retweet networks

Collection(s): CLARIN.SI data & tools

Show full item record

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Name: clarin_plos.zip
Size: 182.04 MB
Format: application/zip
Description: Dataset in CSV format
MD5: 58a693968c40b81b4cf483265e918a6a

Download file Preview

File Preview

- README.txt843 B
- clarin_plos_15072021.csv609 MB