Skip to main content

Datasets

Standard Dataset

Korean Voice Phishing Detection Dataset with Multilingual Back-Translation and SMOTE Augmentations

Citation Author(s):
Milandu Keith Moussavou Boussougou (Soongsil University)
Prince Hamandawana (Ajou University)
Dong-Joo Park (Soongsil University)
Submitted by:
MILANDU KEITH MOUSSAVOU BOUSSOUGOU
Last updated:
DOI:
10.21227/163c-0542
Data Format:
Average: 3 (1 vote)

Abstract

This dataset contains original and augmented versions of the Korean Call Content Vishing (KorCCVi v2) dataset used in the study titled, "Enhancing Voice Phishing Detection Using Multilingual Back-Translation and SMOTE: An Empirical Study." The dataset addresses challenges of data imbalance and asymmetry in Korean voice phishing detection, leveraging data augmentation techniques such as multilingual back-translation (BT) with English, Chinese, and Japanese as intermediate languages, and Synthetic Minority Oversampling Technique (SMOTE). The augmented dataset provides a valuable resource for machine learning (ML) and deep learning (DL) applications in natural language processing (NLP) and cybersecurity research.

 

Instructions:

Dataset Description

The dataset consists of original and augmented samples derived from the KorCCVi v2 dataset, which contains transcripts of Korean phone conversations classified into two categories:

  1. Voice Phishing Conversations (Vishing): Genuine transcripts of voice phishing scams.
  2. Non-Voice Phishing Conversations (Non-Vishing): Normal phone conversations.

Augmentation Techniques

  1. Multilingual Back-Translation (BT):

    • Intermediate languages: English, Chinese, Japanese.
    • Back-translation process preserves linguistic and contextual fidelity while generating diverse synthetic samples.
  2. SMOTE (Synthetic Minority Oversampling Technique):

    • Balances the dataset by oversampling the minority class.

Data Structure

  1. Original Dataset: Contains unbalanced data with two classes (vishing and non-vishing).
  2. BT-Augmented Datasets: Augmented training datasets using back-translation:
    • BT-Eng: Korean ⇌ English.
    • BT-Chi: Korean ⇌ Chinese.
    • BT-Jap: Korean ⇌ Japanese.
    • BT-All: Combination of all BT augmentations.
  3. Dataset Format

    • File Types: CSV files.
    • Attributes:
      • id: Unique identifier for each sample.
      • text: Transcript of the conversation.
      • label: Class label (0 = Non-Vishing, 1 = Vishing).

    Total Samples

    • Original Dataset: 2,927 samples (695 vishing, 2,232 non-vishing).
    • BT-Augmented Datasets: ~3,502 samples for the largest combination.
  4. Applications

    The dataset supports research in:

    • Voice phishing detection using ML and DL.
    • Natural language processing in low-resource languages.
    • Cybersecurity applications focused on social engineering.
  5. Licensing

    • The dataset is released under the CC BY-NC-SA 4.0 license, allowing non-commercial use with attribution.
 
Funding Agency
Ministry of Science and ICT, Korea
Grant Number
2024-0-00071