Datasets
Standard Dataset
Graph dataset - LibTiff
- Citation Author(s):
- Submitted by:
- Mahmoud Zamani
- Last updated:
- Fri, 08/09/2024 - 11:33
- DOI:
- 10.21227/vdfg-qj56
- Data Format:
- License:
- Categories:
Abstract
A comparative, empirical study of state-of-the-art contrastive and generative graph learning models applied to source and binary software fragments drawn from the National Vulnerability Database (NVD) reveals that Graph Masked Auto-Encoders show exceptional promise for detecting security vulnerabilities, outperforming all other baseline models in the study. This fills a key gap in the literature on automated and machine-assisted discovery and patching of software security vulnerabilities, which has become increasingly critical with the dramatic increase in modern software complexity, but for which Graph Neural Network (GNN) approaches are understudied relative to traditional processes, such as manual source code auditing and fuzzing.
To conduct the study, a novel dataset is first collected by extracting vulnerable code fragments from six applications with NVD-documented security flaws and converting these codes to five different graph types using specialized tools based on code property graphs and binary semantics lifting. The resulting dataset is applied to GNN-based analyses to determine which algorithm and graph type performs best, followed by an ablation study to determine which combination of parameters maximizes effectiveness of the top-performing detector. The study is the first to train GNN models on a combination of source- and binary-level code features, which is important for helping cyber defenders craft source-level patches that defend against binary-level attacks.
Pytorch Geometric data format for Graph Neural Network usages.
README for dataset Vulnerability Detection
=== Usage ===
This folder contains the following comma separated text files
(replace DS by the name of the dataset):
n = total number of nodes
m = total number of edges
N = number of graphs
(1) DS_A.txt (m lines)
sparse (block diagonal) adjacency matrix for all graphs,
each line corresponds to (row, col) resp. (node_id, node_id)
(2) DS_graph_indicator.txt (n lines)
column vector of graph identifiers for all nodes of all graphs,
the value in the i-th line is the graph_id of the node with node_id i
(3) DS_graph_labels.txt (N lines)
class labels for all graphs in the dataset,
the value in the i-th line is the class label of the graph with graph_id i
(4) DS_node_labels.txt (n lines)
column vector of node labels,
the value in the i-th line corresponds to the node with node_id i
There are OPTIONAL files if the respective information is available:
(5) DS_edge_labels.txt (m lines; same size as DS_A_sparse.txt)
labels for the edges in DD_A_sparse.txt
(6) DS_edge_attributes.txt (m lines; same size as DS_A.txt)
attributes for the edges in DS_A.txt
(7) DS_node_attributes.txt (n lines)
matrix of node attributes,
the comma seperated values in the i-th line is the attribute vector of the node with node_id i
(8) DS_graph_attributes.txt (N lines)
regression values for all graphs in the dataset,
the value in the i-th line is the attribute of the graph with graph_id i