Datasets
Standard Dataset
Korean Voice Phishing Detection Dataset with Multilingual Back-Translation and SMOTE Augmentations
- Citation Author(s):
- Submitted by:
- MILANDU KEITH M...
- Last updated:
- Mon, 11/11/2024 - 05:03
- DOI:
- 10.21227/163c-0542
- Data Format:
- License:
Abstract
This dataset contains original and augmented versions of the Korean Call Content Vishing (KorCCVi v2) dataset used in the study titled, "Enhancing Voice Phishing Detection Using Multilingual Back-Translation and SMOTE: An Empirical Study." The dataset addresses challenges of data imbalance and asymmetry in Korean voice phishing detection, leveraging data augmentation techniques such as multilingual back-translation (BT) with English, Chinese, and Japanese as intermediate languages, and Synthetic Minority Oversampling Technique (SMOTE). The augmented dataset provides a valuable resource for machine learning (ML) and deep learning (DL) applications in natural language processing (NLP) and cybersecurity research.
Dataset Description
The dataset consists of original and augmented samples derived from the KorCCVi v2 dataset, which contains transcripts of Korean phone conversations classified into two categories:
- Voice Phishing Conversations (Vishing): Genuine transcripts of voice phishing scams.
- Non-Voice Phishing Conversations (Non-Vishing): Normal phone conversations.
Augmentation Techniques
-
Multilingual Back-Translation (BT):
- Intermediate languages: English, Chinese, Japanese.
- Back-translation process preserves linguistic and contextual fidelity while generating diverse synthetic samples.
-
SMOTE (Synthetic Minority Oversampling Technique):
- Balances the dataset by oversampling the minority class.
Data Structure
- Original Dataset: Contains unbalanced data with two classes (vishing and non-vishing).
- BT-Augmented Datasets: Augmented training datasets using back-translation:
- BT-Eng: Korean ⇌ English.
- BT-Chi: Korean ⇌ Chinese.
- BT-Jap: Korean ⇌ Japanese.
- BT-All: Combination of all BT augmentations.
Dataset Format
- File Types: CSV files.
- Attributes:
id
: Unique identifier for each sample.text
: Transcript of the conversation.label
: Class label (0 = Non-Vishing, 1 = Vishing).
Total Samples
- Original Dataset: 2,927 samples (695 vishing, 2,232 non-vishing).
- BT-Augmented Datasets: ~3,502 samples for the largest combination.
Applications
The dataset supports research in:
- Voice phishing detection using ML and DL.
- Natural language processing in low-resource languages.
- Cybersecurity applications focused on social engineering.
Licensing
- The dataset is released under the CC BY-NC-SA 4.0 license, allowing non-commercial use with attribution.