Multi-Label Extremism and Jihadism Classification Tweets Dataset

Citation Author(s):: Mahamodul Hasan Mahadi

Md. Nasif Safwan
Submitted by:: Mahamodul Hasan Mahadi
Last updated:: Sat, 01/25/2025 - 15:58
DOI:: 10.21227/6gmh-1b80
Data Format:: *.csv

131 views

Categories:

Machine Learning

Keywords:

Extremism and Jihadism Classification Tweets Dataset

ACCESS DATASET CITE

Abstract

The "Multi-Label Extremism and Jihadism Classification Tweets Dataset" dataset is a multilingual resource designed for multi-label classification of online extremism and toxic behavior, including extremism and jihadism. Each comment is annotated with labels indicating the presence of various extremism traits: toxic, severe toxic, obscenity, threats, insults, identity hate, and jihadi content. This dataset is valuable for research in automated content moderation, enabling the detection of harmful and extremist content across multiple languages, and contributing to the development of safer online environments by providing a diverse array of real-world examples.

Instructions:

The dataset is currently under embargo and will become publicly available on December 31, 2025. Until then, access is restricted to protect the integrity of the research.
Embargo Note: This dataset is under embargo until December 31, 2025. It may not be used, distributed, or cited before this date.

Files

Terrorism and Multi Toxic labels Classification.csv: The primary dataset file containing the comments and their corresponding labels.

Columns

id: A unique identifier for each comment.
comment_text: The raw text of the comment.
toxic: Binary label (0 or 1) indicating the presence of general toxicity.
severe_toxic: Binary label (0 or 1) indicating the presence of severe toxicity.
obscene: Binary label (0 or 1) indicating the presence of obscenity.
threat: Binary label (0 or 1) indicating the presence of threats.
insult: Binary label (0 or 1) indicating the presence of insults.
identity_hate: Binary label (0 or 1) indicating the presence of identity hate.
jihadi: Binary label (0 or 1) indicating the presence of jihadist content.

Labels

Each comment is annotated with multiple binary labels that indicate the presence (1) or absence (0) of the following traits:

Toxic: General harmful language.
Severe Toxic: Extremely harmful or aggressive language.
Obscene: Language that is offensive or vulgar.
Threat: Language that expresses intent to harm.
Insult: Language intended to offend or demean.
Identity Hate: Language that targets a person or group based on their identity.
Jihadi: Content associated with jihadism or extremist ideologies.

Applications

This dataset can be used for various tasks, including but not limited to:

Multi-label classification: Identifying multiple forms of extremism and toxicity in a single comment.
Extremism detection: Developing models that can detect online extremism, including extremism and jihadism.
Content moderation: Training models to assist in automated content moderation systems.

The "Terrorism and Multi-Toxic Labels Classification" dataset is a multilingual dataset curated to assist in the development and evaluation of models aimed at detecting online extremism and toxic behaviors. This dataset is particularly suited for tasks involving multi-label classification, where each comment may exhibit multiple forms of extremism and toxicity.

Mahamodul Hasa… Fri, 08/30/2024 - 12:46 Permalink