Hate Speech in Chilean Twitter

Citation Author(s):: Domingo Benoit (Universidad Técnica Federico Santa María)

Ricardo Ñanculef (Universidad Técnica Federico Santa María)
Submitted by:: Domingo Benoit
Last updated:: Mon, 07/08/2024 - 19:59
DOI:: 10.21227/8b9y-wy71
Data Format:: *.csv

150 views

Categories:

Keywords:

Artificial Intelligence; Dataset; Machine Learning;Hate Speech

ACCESS DATASET CITE

Abstract

In the last few years, several organizations have manifested their concern over the increase in use of Hateful Speech or Hate Speech for short, this concept refers to forms of expression or audio-visual content that encourage discrimination or violence against individuals or groups solely based on their gender, sexual orientation, ethnicity, religion or nationality. Being able to monitor this phenomenon in a timely manner can help societies and their governments to prevent tensions, crimes and conflicts that endangers not only the most fundamental democratic values but also order stability and social peace.

The fast massification of social platforms has transformed them into one of the main mediums used by people today for creating and sharing information. Consequently, social media platforms such as Twitter, Instagram or Facebook are the staging in which Hate Speech is mostly propagated today. Sadly the great reach of these platforms, their public nature, the social dynamics that are perpetuated in them and the absence of an explicit regulatory framework, only worsen and increase the magnitude of this phenomena. Mining such conversations, such as Tweets, to develop a dataset can serve as a data resource for interdisciplinary research related to the analysis of interest, views, opinions and help us in the creation of tools to further our understanding of social dynamics related to Hate Speech propagation and analysis.The dataset is compliant with the privacy policy, developer agreement, and guidelines for content redistribution of Twitter and the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data management.

Data Description

The dataset comprises a total of 4,547 Tweets ID’s, Authors ID’s about Hate Speech related to chilean dialect or news that were posted on Twitter from 2020 to july 2022 and and 6542 Tweets ID’s related to the classified tweets context.

tweets_train.csv - Train set.
public_test_data.csv - Test set.
referenced_tweets_data.csv - Referenced Tweets data

The train set includes 2255 examples labeled in 5 clases: "Odio", "Mujeres", "Comunidad LGBTQ+", "Comunidades Migrantes", "Pueblos Originarios" with values from 0 to 1 indicating 0 for false and 1 for true where 0 means the tweets doesn't contain the class.

The train set examples include the following columns:

tweet_id
author_id
conversation_id: Tuple which contains the tweets_id (from the file referenced_tweeets_data.csv) to which the labeled tweet references. this Id’s are in such order that in the first position is the tweet referenced in the labeled tweet, then the id in the second position is referenced by the tweet in the first position and so on and so on…

The dataset contains only Tweet and Author IDs in compliance with the terms and conditions mentioned in the privacy policy, developer agreement, and guidelines for content redistribution of Twitter. The Tweet IDs need to be hydrated to be used. For hydrating this dataset the Hydrator application (link to download and a step-by-step tutorial on how to use Hydrator) may be used.