Korean Voice Phishing Detection Dataset with Multilingual Back-Translation and SMOTE Augmentations

Citation Author(s):: Milandu Keith Moussavou Boussougou (Soongsil University)

Prince Hamandawana (Ajou University)

Dong-Joo Park (Soongsil University)
Submitted by:: MILANDU KEITH MOUSSAVOU BOUSSOUGOU
Last updated:: Mon, 11/11/2024 - 10:03
DOI:: 10.21227/163c-0542
Data Format:: CSV

682 views

Categories:

Keywords:

Voice Phishing; Data Augmentation; Back-Translation; SMOTE; Imbalanced Dataset; Natural Language Processing; Cybersecurity; Korean Language

ACCESS DATASET CITE

Abstract

This dataset contains original and augmented versions of the Korean Call Content Vishing (KorCCVi v2) dataset used in the study titled, "Enhancing Voice Phishing Detection Using Multilingual Back-Translation and SMOTE: An Empirical Study." The dataset addresses challenges of data imbalance and asymmetry in Korean voice phishing detection, leveraging data augmentation techniques such as multilingual back-translation (BT) with English, Chinese, and Japanese as intermediate languages, and Synthetic Minority Oversampling Technique (SMOTE). The augmented dataset provides a valuable resource for machine learning (ML) and deep learning (DL) applications in natural language processing (NLP) and cybersecurity research.

Instructions:

Dataset Description

The dataset consists of original and augmented samples derived from the KorCCVi v2 dataset, which contains transcripts of Korean phone conversations classified into two categories:

Voice Phishing Conversations (Vishing): Genuine transcripts of voice phishing scams.
Non-Voice Phishing Conversations (Non-Vishing): Normal phone conversations.

Augmentation Techniques

Multilingual Back-Translation (BT):
- Intermediate languages: English, Chinese, Japanese.
- Back-translation process preserves linguistic and contextual fidelity while generating diverse synthetic samples.
SMOTE (Synthetic Minority Oversampling Technique):
- Balances the dataset by oversampling the minority class.

Data Structure

Original Dataset: Contains unbalanced data with two classes (vishing and non-vishing).
BT-Augmented Datasets: Augmented training datasets using back-translation:

BT-Eng: Korean ⇌ English.
BT-Chi: Korean ⇌ Chinese.
BT-Jap: Korean ⇌ Japanese.
BT-All: Combination of all BT augmentations.

Dataset Format
- File Types: CSV files.
- Attributes:
  - id: Unique identifier for each sample.
  - text: Transcript of the conversation.
  - label: Class label (0 = Non-Vishing, 1 = Vishing).
Total Samples