Abstract

The increasing prevalence of encrypted traffic in

modern networks poses significant challenges for network security,

particularly in detecting and classifying malicious activities

and application signatures. To overcome this issue, deep learning

has turned out to be a promising candidate owing to its ability

to learn complex data patterns. In this work, we present a

deep learning-based novel and robust framework for encrypted

traffic analysis (ETA) which leverages the power of Bidirectional

Encoder Representations from Transformers (BERT) and Long

Short-Term Memory (LSTM) networks. Our proposed framework

leverages the capability of LSTM to capture long-term

dependencies in sequential data for modeling the temporal patterns

of network packets, while BERT enhances this by providing

an understanding of the bidirectional context within packet

sequences. Hence, this approach of ETA relies on LSTM for

enabling effective detection of anomalies and prediction of future

packet where BERT provides a deeper contextual understanding

of the traffic flow. Publicly available dataset ISCXVPN2016

and CSTNET are used to test our proposed framework which

outperformed the existing works by yielding an accuracy rate

(AC) of 99.65%, precision (PR) of 99.53% and recall (RC) of

99.28%. The proposed framework serves to efficiently detect

Over-the-Top (OTT) application signatures within encrypted

traffic streams, ensuring comprehensive network monitoring and

enhanced security measures without compromising the integrity

of packets.

Instructions:

Description of processing PCAP files to generate dataset

For PCAP data, it is recommended to clean it first. Since the program processing logic is not smooth, we detail the data pre-processing for pre-training and fine-tuning as followed.

Pre-training Stage

Main Program: dataset_generation.py

Functions: pretrain_dataset_generation, get_burst_feature

Initialization.
Set the variable pcap_path (line:616) as the directory of PCAP data to be processed.
Set the variable word_dir (line:23) and word_name (line:24) as the storage directory of pre-training daraset.
Pre-process PCAP.
Set the variable output_split_path (line:583) and pcap_output_path (line:584).
The pcap_output_path indicates the storage directory where the pcapng format of PCAP data is converted to pcap format.
The output_split_path represents the storage directory for PCAP data slicing into session format.
Gnerate Pre-training Datasets.
Following the completion of PCAP data processing, the program generates a pre-training dataset composed of BURST.

Fine-tuning Stage

Main Program: main.py

Functions: data_preprocess.py, dataset_generation.py, open_dataset_deal.py, dataset_cleanning.py

The key idea of the fine-tuning phase when processing public PCAP datasets is to first distinguish folders for different labeled data in the dataset, then perform session slicing on the data, and finally generate packet-level or flow-level datasets according to sample needs.

Note: Due to the complexity of the possible existence of raw PCAP data, it is recommended that the following steps be performed to check the code execution when it reports an error.

Initialization.
pcap_path, dataset_save_path, samples, features, dataset_level (line:28) are the basis variables, which represent the original data directory, the stored generated data directory, the number of samples, the feature type, and the data level. open_dataset_not_pcap (line:215) represents the processing of converting PCAP data to pcap format, e.g. pcapng to pcap.
And file2dir (line:226) represents the generation of category directories to store PCAP data when a pcap file is a category.
Pre-process.
The data pre-processing is primarily to split the PCAP data in the directory into session data.
Please set the splitcap_finish parameter to 0 to initialize the sample number array, and the value of sample set at this time should not exceed the minimum number of samples.
Then you can set splitcap=True (line:54) and run the code for splitting PCAP data. The splitted sessions will be saved in pcap_path\\splitcap.
Generation.
After data pre-processing is completed, variables need to be changed for generating fine-tuned training data. The pcap_path should be the path of splitted data and set
splitcap=False. Now the sample can be unrestricted by the minimum sample size. The open_dataset_not_pcap and file2dir should be False. Then the dataset for fine-tuning will be generated and saved in dataset_save_path.

Dataset Files

Files have not been uploaded for this dataset

Datasets

Standard Dataset

CSTNET

Abstract

Description of processing PCAP files to generate dataset

Pre-training Stage

Fine-tuning Stage

Dataset Files

QUESTIONS?