Korean Voice Phishing Detection Dataset with Multilingual Back-Translation and SMOTE Augmentations

- Citation Author(s):
- Submitted by:
- MILANDU KEITH MOUSSAVOU BOUSSOUGOU
- Last updated:
- DOI:
- 10.21227/163c-0542
- Data Format:
Abstract
This dataset contains original and augmented versions of the Korean Call Content Vishing (KorCCVi v2) dataset used in the study titled, "Enhancing Voice Phishing Detection Using Multilingual Back-Translation and SMOTE: An Empirical Study." The dataset addresses challenges of data imbalance and asymmetry in Korean voice phishing detection, leveraging data augmentation techniques such as multilingual back-translation (BT) with English, Chinese, and Japanese as intermediate languages, and Synthetic Minority Oversampling Technique (SMOTE). The augmented dataset provides a valuable resource for machine learning (ML) and deep learning (DL) applications in natural language processing (NLP) and cybersecurity research.
Instructions:
Dataset Description
The dataset consists of original and augmented samples derived from the KorCCVi v2 dataset, which contains transcripts of Korean phone conversations classified into two categories:
- Voice Phishing Conversations (Vishing): Genuine transcripts of voice phishing scams.
- Non-Voice Phishing Conversations (Non-Vishing): Normal phone conversations.
Augmentation Techniques
Multilingual Back-Translation (BT):
- Intermediate languages: English, Chinese, Japanese.
- Back-translation process preserves linguistic and contextual fidelity while generating diverse synthetic samples.
SMOTE (Synthetic Minority Oversampling Technique):
- Balances the dataset by oversampling the minority class.
Data Structure
- Original Dataset: Contains unbalanced data with two classes (vishing and non-vishing).
- BT-Augmented Datasets: Augmented training datasets using back-translation:
- BT-Eng: Korean ⇌ English.
- BT-Chi: Korean ⇌ Chinese.
- BT-Jap: Korean ⇌ Japanese.
- BT-All: Combination of all BT augmentations.
Dataset Format
- File Types: CSV files.
- Attributes:
id
: Unique identifier for each sample.text
: Transcript of the conversation.label
: Class label (0 = Non-Vishing, 1 = Vishing).
Total Samples
- Original Dataset: 2,927 samples (695 vishing, 2,232 non-vishing).
- BT-Augmented Datasets: ~3,502 samples for the largest combination.
Applications
The dataset supports research in:
- Voice phishing detection using ML and DL.
- Natural language processing in low-resource languages.
- Cybersecurity applications focused on social engineering.
Licensing
- The dataset is released under the CC BY-NC-SA 4.0 license, allowing non-commercial use with attribution.