Abstract

We released TrafficLLM's training datasets, which contain over 0.4M traffic data and 9K human instructions for LLM adaptation across different traffic analysis tasks.

Instruction Datasets: The instruction datasets are used to help LLM learn the domain knowledge of traffic detection or generation tasks and understand which task should be conducted in different scenarios.
Traffic Datasets: The traffic datasets contain the traffic tuning data we extracted from the public traffic datasets, which helps LLM learn the traffic pattern in different downstream tasks.

Instructions:

Instruction Datasets

To build the natural language corpus as the human instructions in TrafficLLM, we collected 9,209 task-specific instructions supervised by experts and AI assistants. The statistics are shown as follows:

| ------------------ | ---------------------------- | ------- | ------- |

| Traffic Detection | Malware Traffic Detection | MTD | 1.0K |

| | Botnet Detection | BND | 1.1K |

| | Malicious DoH Detection | MDD | 0.6K |

| | Web Attack Detection | WAD | 0.6K |

| | APT Attack Detection | AAD | 0.6K |

| | Encrypted VPN Detection | EVD | 1.2K |

| | Tor Behavior Detection | TBD | 0.6K |

| | Encrypted App Classification | EAC | 0.6K |

| | Website Fingerprinting | WF | 0.6K |

| | Concept Drift | CD | 0.6K |

| Traffic Generation | Malware Traffic Generation | MTG | 0.6K |

| | Botnet Traffic Generation | BTG | 0.1K |

| | Encrypted VPN Generation | EVG | 0.4K |

| | Encrypted App Generation | EAG | 0.6K |

Traffic Datasets

To evaluate the performance of TrafficLLM on various network scenarios, we extracted over 0.4M tuning data from public-available traffic datasets to measure TrafficLLM’s abilities to detect or generate malicious and benign traffic. The statistics are shown as follows:

| ---------------- | ---------------------------- | ------- | ------- |

| USTC TFC 2016 | Malware Traffic Detection | MTD | 50.7K |

| ISCX Botnet 2014 | Botnet Detection | BND | 25.0K |

| DoHBrw 2020 | Malicious DoH Detection | MDD | 47.8K |

| CSIC 2010 | Web Attack Detection | WAD | 34.5K |

| DAPT 2020 | APT Attack Detection | AAD | 10.0K |

| ISCX VPN 2016 | Encrypted VPN Detection | EVD | 64.8K |

| ISCX Tor 2016 | Tor Behavior Detection | TBD | 40.0K |

| CSTNET 2023 | Encrypted App Classification | EAC | 97.6K |

| CW-100 2018 | Website Fingerprinting | WF | 7.4K |

| APP-53 2023 | Concept Drift | CD | 109.8K |

Dataset Files

TrafficLLM Datasets datasets.zip (313.62 MB)
TrafficLLM Codes TrafficLLM.zip (350.95 MB)

Datasets

Standard Dataset

TrafficLLM Dataset

Abstract

Instruction Datasets

Traffic Datasets

Dataset Files

QUESTIONS?