CTGAN Enhanced Dataset for UAV Network Intrusion Detection

Citation Author(s):
Qingli
Zeng
Submitted by:
qingli zeng
Last updated:
Mon, 10/14/2024 - 18:18
DOI:
10.21227/v9nr-dk16
License:
0
0 ratings - Please login to submit your rating.

Abstract 

Intrusion detection in Unmanned Aerial Vehicle (UAV) networks is crucial for maintaining the security and integrity of autonomous operations. However, the effectiveness of intrusion detection systems (IDS) is often compromised by the scarcity and imbalance of available datasets, which limits the ability to train accurate and reliable machine learning models. To address these challenges, we present the "CTGAN-Enhanced Dataset for UAV Network Intrusion Detection", a meticulously curated and augmented dataset designed to improve the performance of IDS in UAV environments.

 

To resolve class imbalance and expand the dataset, we began with comprehensive data cleaning, removing incomplete, inconsistent, or irrelevant entries, including null values, NaNs, and infinite values. This preprocessing step ensured the integrity and quality of the dataset. Subsequently, we merged attack categories with similar features to streamline the classification process and enhance the dataset's consistency.

 

Leveraging Conditional Tabular Generative Adversarial Networks (CTGAN), we augmented the dataset by generating synthetic samples that closely replicate the distribution of the original data. CTGAN effectively captures the underlying patterns and relationships within the data, producing high-quality synthetic instances that enhance both the quantity and diversity of the dataset. This augmentation significantly mitigates the issue of class imbalance, providing a more balanced representation of various intrusion types and enabling the training of more robust and generalizable IDS models.

Instructions: 

 

Instructions for Using the "CTGAN-Enhanced UAV Network Intrusion Detection Dataset"

 

1. Dataset Overview:

   - The dataset is derived from two well-known datasets, CIC-IDS2017 and UNSW-NB15, which are frequently used in intrusion detection research.

   - The data underwent extensive preprocessing to improve its quality. We removed incomplete, inconsistent, or irrelevant entries, including null values, NaNs, and infinite values.

   - Attack classes with similar characteristics were merged to reduce redundancy. Here is a breakdown of the new class labels and their respective sample sizes:

 

   CIC-IDS2017:

   - Normal: 2,271,320

   - DoS: 379,737 (Combined from Dos Hulk, DDoS, Dos GoldenEye, Dos Slowloris, Dos Slowhttptest)

   - Probe: 158,804

   - Botnet: 1,956

   - Infiltration: 36

   - Web Attack: 2,180 (Includes Web Attack: Brute Force, Web Attack: XSS, Web Attack: SQL Injection)

   - Brute Force: 13,832 (Includes FTP-Patator, SSH-Patator)

   - Heartbleed: 11

 

   UNSW-NB15:

   - Normal: 93,000

   - DoS: 16,353

   - Exploits: 127,642 (Includes Exploits, Generic, Fuzzers)

   - Reconnaissance: 16,664 (Merged from Reconnaissance, Analysis)

   - Shellcode: 1,511

   - Worms: 174

   - Backdoor: 2,329

 

2. Data Augmentation with CTGAN:

   - We used Conditional Tabular Generative Adversarial Networks (CTGAN) to augment the dataset to tackle issues related to class imbalance and data scarcity.

   - To ensure the quality of the generated synthetic data, we used L2 regularization and WGAN-GP (Wasserstein GAN with Gradient Penalty) to prevent overfitting or mode collapse.

   - Below is a comparison of the total number of samples in the original and augmented datasets:

 

   CIC-IDS2017 (Original vs Augmented):

   - Normal: 2,271,320 (no augmentation)

   - DoS: 379,737 (no augmentation)

   - Probe: 158,804 (no augmentation)

   - Botnet: 1,956 (augmented to 6,956)

   - Infiltration: 36 (augmented to 5,036)

   - Web Attack: 2,180 (augmented to 7,180)

   - Brute Force: 13,832 (no augmentation)

   - Heartbleed: 11 (augmented to 5,011)

 

   UNSW-NB15 (Original vs Augmented):

   - Normal: 93,000 (no augmentation)

   - DoS: 16,353 (no augmentation)

   - Exploits: 127,642 (no augmentation)

   - Reconnaissance: 16,664 (no augmentation)

   - Shellcode: 1,511 (augmented to 6,511)

   - Worms: 174 (augmented to 5,174)

   - Backdoor: 2,329 (augmented to 7,329)

 

3. How to Use the Dataset:

   - The dataset is CTGAN-augmented data. It can be used to train and test machine learning models for UAV network intrusion detection.

   - When using the augmented dataset, pay attention to the total number of samples for each class as shown above, especially for those classes that have been significantly augmented.

 

4. Technical Details:

   - The dataset files are in standard formats compatible with machine learning frameworks such as TensorFlow, PyTorch, and scikit-learn.

   - The documentation files provide detailed descriptions of the feature columns, data structures, and the specific preprocessing steps applied.

 

By following these instructions, users can effectively utilize the dataset to develop and evaluate machine learning models for UAV network intrusion detection. For further assistance, please refer to the provided documentation or contact the dataset authors.