Bangla SMS Dataset for Smishing Detection

Citation Author(s):
Gazi
Tanbhir
MD. Farhan
Shahriyar
Submitted by:
Gazi Tanbhir
Last updated:
Mon, 08/26/2024 - 03:52
DOI:
10.21227/vxz9-ak04
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

 

This dataset comprises 2,287 Bangla text SMS messages, categorized into three classes: Normal, Smish, and Promotional. Collected via an online survey and validated by a cybersecurity expert, the dataset supports research in detecting smishing—a form of phishing via SMS. Each message is meticulously labeled to facilitate the development and evaluation of machine learning models aimed at identifying cyber threats within Bangla SMS communications. This dataset is a valuable resource for advancing cybersecurity measures, particularly in protecting Bangla-speaking users from smishing attacks.

Instructions: 

This dataset was meticulously compiled to aid in the detection of smishing attacks within Bangla language SMS. Data collection was conducted through an online survey targeting Bangla-speaking users. Each SMS was then carefully reviewed and categorized by a cybersecurity expert to ensure the integrity and relevance of the data.

The dataset comprises two columns:

  • label: Indicates the category of the SMS (Normal, Smish, Promotional).
  • text: Contains the actual Bangla text message.

The distribution of the data is as follows:

  • Normal: 924 messages that represent regular, non-threatening SMS.
  • Smish: 914 messages identified as attempts to phish the recipient.
  • Promotional: 449 messages containing promotional content, which is often confused with smishing.

With 1,733 unique entries, this dataset provides a robust foundation for developing and testing models that detect smishing. Researchers can leverage this dataset to build machine learning models that enhance cybersecurity measures, particularly for Bangla-speaking populations.