Datasets
Standard Dataset
Firewall Attack Detections and Extractions (FADE)
- Citation Author(s):
- Submitted by:
- Gavin Black
- Last updated:
- Thu, 03/13/2025 - 17:47
- DOI:
- 10.21227/018f-ka11
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
Validating defenses to meet emerging cybersecurity challenges requires continuous updates to the datasets used for testing. In this paper, we introduce the Firewall Attack Detection Extractions (FADE) dataset designed to address gaps in available collections by generating a diverse and balanced corpus of over 10 million categorized attacks derived from open-source rule sets and public penetration testing repositories. The FADE samples not only provide a wide variety of attacks across eight common categories but also combines these with realistic network traffic to create 50 million total entries, offering a balanced mix of labeled benign and malicious traffic. The methodology for dataset creation, along with the algorithms used for payload injection, is detailed to enhance reproducibility for future attack inclusions. We also provide an exploratory data analysis that demonstrates the characteristics of the dataset, including similarities between attack token frequencies and embedding spaces, underscoring the challenges and considerations necessary for developing effective defensive security tools. A classification performance baseline is established using multiple methods, highlighting the difficulty in crafting suitable predictive models. The FADE dataset, along with the relevant tools for data analysis, is being publicly released to foster research and development in network security.
The FADE dataset is available in both parquet and csv formats to readily accommodate common applications: the parquet form is optimized for high-performance for data analyis tasks, while the csv format ensures the data is easily accessible across tools. Both formats contain identical content, split into files containing the combined requests, malicious payloads, and sets of benign request entities. Furthermore, a Python notebook is included that provide measures on the dataset.
The high-level list of included files, along with their descriptions, is provided below (Note: CSV files are forthcoming, only parquet is currently present):
• data/FADE 50M.[parquet/csv]: These files contain the primary dataset containing fully formed requests. These requests are balanced with payload injections and contain the following columns:
- sample: Text contents of the request
- class: Overall unique combined class
- label: If request is ‘malicious’ or ‘benign’
- benign_class: Type of nominal network traffic
- malicious_class: Type of injected attack if present
- malicious_source: Origin of the malicious payload
- payload_start_byte: Offset of injection start in sample
- payload_end_byte: Offset of injection end in sample
• data/injections.[parquet/csv]: Files containing the payloads in isolation, no included within requests. These samples are suitable for injection into network streams. Columns include text(the payload), class(attack type), and source(origin of the sample).
• code/corpus.py: A Python script for analyzing the dataset to provide a detailed breakdown of request types, including attacks, and the overall contents of the FADE dataset. Outputs:
- distribution_pie.png: The breakdown of benign vs. malicious samples
- sample_lengths.png: Average length of the sample strings in bytes, split by label
- injection_classes.png: Number of injection samples for each type of attack pattern present
• requirements.txt: Python library dependencies for pip
Acknowledgment: We extend our thanks to Leidos for the financial support of this research and allowing public release of the findings under approval number 25-LEIDOS-0303-29073.
Dataset Files
- FADE_50M.parquet (20.41 GB)
- injections.parquet (213.31 MB)
- corpus.py (1.81 kB)
- requirements.txt (33 bytes)