TrafficLLM Dataset

Citation Author(s):
Tianyu
Cui
Zhongguancun Laboratory
Submitted by:
Tianyu Cui
Last updated:
Tue, 04/01/2025 - 07:47
DOI:
10.21227/6s59-fx73
License:
112 Views
Categories:
Keywords:
0
0 ratings - Please login to submit your rating.

Abstract 

We released TrafficLLM's training datasets, which contain over 0.4M traffic data and 9K human instructions for LLM adaptation across different traffic analysis tasks.

  • Instruction Datasets: The instruction datasets are used to help LLM learn the domain knowledge of traffic detection or generation tasks and understand which task should be conducted in different scenarios.
  • Traffic Datasets: The traffic datasets contain the traffic tuning data we extracted from the public traffic datasets, which helps LLM learn the traffic pattern in different downstream tasks.
Instructions: 

Instruction Datasets


To build the natural language corpus as the human instructions in TrafficLLM, we collected 9,209 task-specific instructions supervised by experts and AI assistants. The statistics are shown as follows:

| Mainstream Tasks   | Downstream Tasks             | Abbrev. | #Sample |

| ------------------ | ---------------------------- | ------- | ------- |

| Traffic Detection | Malware Traffic Detection    | MTD     | 1.0K    |

|                    | Botnet Detection             | BND     | 1.1K    |

|                    | Malicious DoH Detection      | MDD     | 0.6K    |

|                    | Web Attack Detection         | WAD     | 0.6K    |

|                    | APT Attack Detection         | AAD     | 0.6K    |

|                    | Encrypted VPN Detection      | EVD     | 1.2K    |

|                    | Tor Behavior Detection       | TBD     | 0.6K    |

|                    | Encrypted App Classification | EAC     | 0.6K    |

|                    | Website Fingerprinting       | WF      | 0.6K    |

|                    | Concept Drift                | CD      | 0.6K    |

| Traffic Generation | Malware Traffic Generation   | MTG     | 0.6K    |

|                    | Botnet Traffic Generation    | BTG     | 0.1K    |

|                    | Encrypted VPN Generation     | EVG     | 0.4K    |

|                    | Encrypted App Generation     | EAG     | 0.6K    |

    

Traffic Datasets


To evaluate the performance of TrafficLLM on various network scenarios, we extracted over 0.4M tuning data from public-available traffic datasets to measure TrafficLLM’s abilities to detect or generate malicious and benign traffic. The statistics are shown as follows:

| Datasets         | Tasks                        | Abbrev. | #Sample |

| ---------------- | ---------------------------- | ------- | ------- |

| USTC TFC 2016    | Malware Traffic Detection    | MTD     | 50.7K   |

| ISCX Botnet 2014 | Botnet Detection             | BND     | 25.0K   |

| DoHBrw 2020      | Malicious DoH Detection      | MDD     | 47.8K   |

| CSIC 2010        | Web Attack Detection         | WAD     | 34.5K   |

| DAPT 2020        | APT Attack Detection         | AAD     | 10.0K   |

| ISCX VPN 2016    | Encrypted VPN Detection      | EVD     | 64.8K   |

| ISCX Tor 2016    | Tor Behavior Detection       | TBD     | 40.0K   |

| CSTNET 2023      | Encrypted App Classification | EAC     | 97.6K   |

| CW-100 2018      | Website Fingerprinting       | WF      | 7.4K    |

| APP-53 2023      | Concept Drift                | CD      | 109.8K  |