Datasets
Standard Dataset
TrafficLLM Dataset
- Citation Author(s):
- Submitted by:
- Tianyu Cui
- Last updated:
- Tue, 04/01/2025 - 07:47
- DOI:
- 10.21227/6s59-fx73
- License:
- Categories:
- Keywords:
Abstract
We released TrafficLLM's training datasets, which contain over 0.4M traffic data and 9K human instructions for LLM adaptation across different traffic analysis tasks.
Instruction Datasets
: The instruction datasets are used to help LLM learn the domain knowledge of traffic detection or generation tasks and understand which task should be conducted in different scenarios.Traffic Datasets
: The traffic datasets contain the traffic tuning data we extracted from the public traffic datasets, which helps LLM learn the traffic pattern in different downstream tasks.
Instruction Datasets
To build the natural language corpus as the human instructions in TrafficLLM, we collected 9,209 task-specific instructions supervised by experts and AI assistants. The statistics are shown as follows:
| Mainstream Tasks | Downstream Tasks | Abbrev. | #Sample |
| ------------------ | ---------------------------- | ------- | ------- |
| Traffic Detection | Malware Traffic Detection | MTD | 1.0K |
| | Botnet Detection | BND | 1.1K |
| | Malicious DoH Detection | MDD | 0.6K |
| | Web Attack Detection | WAD | 0.6K |
| | APT Attack Detection | AAD | 0.6K |
| | Encrypted VPN Detection | EVD | 1.2K |
| | Tor Behavior Detection | TBD | 0.6K |
| | Encrypted App Classification | EAC | 0.6K |
| | Website Fingerprinting | WF | 0.6K |
| | Concept Drift | CD | 0.6K |
| Traffic Generation | Malware Traffic Generation | MTG | 0.6K |
| | Botnet Traffic Generation | BTG | 0.1K |
| | Encrypted VPN Generation | EVG | 0.4K |
| | Encrypted App Generation | EAG | 0.6K |
Traffic Datasets
To evaluate the performance of TrafficLLM on various network scenarios, we extracted over 0.4M tuning data from public-available traffic datasets to measure TrafficLLM’s abilities to detect or generate malicious and benign traffic. The statistics are shown as follows:
| Datasets | Tasks | Abbrev. | #Sample |
| ---------------- | ---------------------------- | ------- | ------- |
| USTC TFC 2016 | Malware Traffic Detection | MTD | 50.7K |
| ISCX Botnet 2014 | Botnet Detection | BND | 25.0K |
| DoHBrw 2020 | Malicious DoH Detection | MDD | 47.8K |
| CSIC 2010 | Web Attack Detection | WAD | 34.5K |
| DAPT 2020 | APT Attack Detection | AAD | 10.0K |
| ISCX VPN 2016 | Encrypted VPN Detection | EVD | 64.8K |
| ISCX Tor 2016 | Tor Behavior Detection | TBD | 40.0K |
| CSTNET 2023 | Encrypted App Classification | EAC | 97.6K |
| CW-100 2018 | Website Fingerprinting | WF | 7.4K |
| APP-53 2023 | Concept Drift | CD | 109.8K |
Dataset Files
- TrafficLLM Datasets datasets.zip (313.62 MB)
- TrafficLLM Codes TrafficLLM.zip (350.95 MB)