Korean Voice Phishing Detection Dataset with Multilingual Back-Translation and SMOTE Augmentations

Citation Author(s):
Milandu Keith
Moussavou Boussougou
Soongsil University
Prince
Hamandawana
Ajou University
Dong-Joo
Park
Soongsil University
Submitted by:
MILANDU KEITH M...
Last updated:
Mon, 11/11/2024 - 05:03
DOI:
10.21227/163c-0542
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

This dataset contains original and augmented versions of the Korean Call Content Vishing (KorCCVi v2) dataset used in the study titled, "Enhancing Voice Phishing Detection Using Multilingual Back-Translation and SMOTE: An Empirical Study." The dataset addresses challenges of data imbalance and asymmetry in Korean voice phishing detection, leveraging data augmentation techniques such as multilingual back-translation (BT) with English, Chinese, and Japanese as intermediate languages, and Synthetic Minority Oversampling Technique (SMOTE). The augmented dataset provides a valuable resource for machine learning (ML) and deep learning (DL) applications in natural language processing (NLP) and cybersecurity research.

 

Instructions: 

Dataset Description

The dataset consists of original and augmented samples derived from the KorCCVi v2 dataset, which contains transcripts of Korean phone conversations classified into two categories:

  1. Voice Phishing Conversations (Vishing): Genuine transcripts of voice phishing scams.
  2. Non-Voice Phishing Conversations (Non-Vishing): Normal phone conversations.

Augmentation Techniques

  1. Multilingual Back-Translation (BT):

    • Intermediate languages: English, Chinese, Japanese.
    • Back-translation process preserves linguistic and contextual fidelity while generating diverse synthetic samples.
  2. SMOTE (Synthetic Minority Oversampling Technique):

    • Balances the dataset by oversampling the minority class.

Data Structure

  1. Original Dataset: Contains unbalanced data with two classes (vishing and non-vishing).
  2. BT-Augmented Datasets: Augmented training datasets using back-translation:
  • BT-Eng: Korean ⇌ English.
  • BT-Chi: Korean ⇌ Chinese.
  • BT-Jap: Korean ⇌ Japanese.
  • BT-All: Combination of all BT augmentations.
  • Dataset Format

    • File Types: CSV files.
    • Attributes:
      • id: Unique identifier for each sample.
      • text: Transcript of the conversation.
      • label: Class label (0 = Non-Vishing, 1 = Vishing).

    Total Samples

    • Original Dataset: 2,927 samples (695 vishing, 2,232 non-vishing).
    • BT-Augmented Datasets: ~3,502 samples for the largest combination.
  • Applications

    The dataset supports research in:

    • Voice phishing detection using ML and DL.
    • Natural language processing in low-resource languages.
    • Cybersecurity applications focused on social engineering.
  • Licensing

    • The dataset is released under the CC BY-NC-SA 4.0 license, allowing non-commercial use with attribution.
  •  

    Funding Agency: 
    Ministry of Science and ICT, Korea
    Grant Number: 
    2024-0-00071