Dataset for Cyberbullying detection in Mixed Urdu, Roman Urdu, and English Social Media Conversations

Citation Author(s):
Submitted by:
Fakhra Razi
Last updated:
Fri, 02/16/2024 - 13:16
Data Format:
0 ratings - Please login to submit your rating.


The dataset crafted for this study is intentionally designed to encapsulate instances of cyberbullying across three distinct languages: Urdu, Roman Urdu, and English. This strategic selection aims to mirror the linguistic variations that are prevalent in social media dialogues among Urdu-speaking communities globally. Further, it undergoes meticulous annotation to encapsulate the diverse linguistic nuances characteristic of these languages. This process includes integrating critical aspects of cyberbullying, such as aggression, repetition, and intent to harm. Such a comprehensive approach is pivotal in ensuring that the dataset not only captures the complex dynamics of cyberbullying but also addresses it in a multilingual context with the depth and breadth required for effective analysis and detection.


The dataset contains the following columns:


Date: The date on which the interaction took place.

Time: The time at which the interaction occurred.

User1 ID: An identifier for the first user involved in the interaction.

User2 ID: An identifier for the second user involved in the interaction.

Message: The content of the message exchanged in the interaction, which appears to include messages in Urdu (and possibly Roman Urdu or English).

Aggressive: A binary indicator (1 or 0) denoting whether the message is aggressive.

CB (Cyberbullying): A binary indicator (1 or 0) denoting whether the interaction is considered cyberbullying based on Annotators' annotation. Apart from aggressive content, the other components of Cyberbullying like repetition and intent to harm have been considered to assign Cyberbullying label.