Multilabel Extremism Classification Tweets Dataset

Citation Author(s):: Mahamodul Hasan Mahadi

Md. Nasif Safwan
Submitted by:: Mahamodul Hasan Mahadi
Last updated:: Sat, 01/25/2025 - 15:57
DOI:: 10.21227/rxj1-hm02
Data Format:: *.csv

220 views

Categories:

Machine Learning

Keywords:

Extremism Classification Tweets Dataset

ACCESS DATASET CITE

Abstract

The "Multilabel Extremism Classification Tweets Dataset" dataset contains user comments annotated with labels including toxic, severe toxic, obscene, threat, insult, and identity hate. Designed for multi-label classification, this dataset is valuable for researchers focused on detecting online extremism and toxicity across multiple languages. It enables the development of NLP models for content moderation, hate speech detection, and extremism identification. By providing diverse examples of harmful online behavior, the dataset supports the creation of robust models capable of recognizing and categorizing different forms of extremism in various contexts.

Instructions:

The dataset is currently under embargo and will become publicly available on December 31, 2025. Until then, access is restricted to protect the integrity of the research.
Embargo Note: This dataset is under embargo until December 31, 2025. It may not be used, distributed, or cited before this date.

The dataset is structured in a tabular format with the following columns:

id: Unique identifier for each comment.
comment: The text of the user-generated comment.
toxic: Binary label indicating if the comment is toxic (1) or not (0).
severe_toxic: Binary label indicating if the comment is severely toxic (1) or not (0).
obscene: Binary label indicating if the comment is obscene (1) or not (0).
threat: Binary label indicating if the comment contains a threat (1) or not (0).
insult: Binary label indicating if the comment contains an insult (1) or not (0).
identity_hate: Binary label indicating if the comment contains identity-based hate (1) or not (0).