Abstract

Here are some of the software vulnerability real-world data sets.

The original real-world data sets, collected by Lin et al. (https://github.com/DanielLin1986/TransferRepresentationLearning), which contain the source codes of vulnerable and non-vulnerable functions obtained from six real-world software projects, namely FFmpeg, LibTIFF, LibPNG, VLC and Pidgin. These datasets cover both multimedia and image application categories.

To obtain our used data sets, we preprocess these data sets before inputting into the deep neural networks. Firstly, we standardize the source codes by removing comments, blank lines and non-ASCII characters. Secondly, we map user-defined variables to symbolic names (e.g., “var1”, “var2”) and user-defined functions to symbolic names (e.g., “func1”, “func2”). We also replace integers, real and hexadecimal numbers with a generic "number" token and strings with a generic "str" token. We use https://joern.readthedocs.io/en/latest/ to analyze the source codes to get user-defined variables and functions.

Instructions:

For the original real-world data sets, please reference https://github.com/DanielLin1986/TransferRepresentationLearning Lin et al. (2018), which contain the source codes of vulnerable and non-vulnerable functions obtained from six real-world software projects, namely FFmpeg, LibTIFF, LibPNG, VLC and Pidgin. These data sets cover both multimedia and image application categories.

Our used data sets are obtained from the data processing phase. In particular, firstly, we standardize the source codes by removing comments, blank lines and non-ASCII characters. Secondly, we map user-defined variables to symbolic names (e.g., “var1”, “var2”) and user-defined functions to symbolic names (e.g., “func1”, “func2”). We also replace integers, real and hexadecimal numbers with a generic "number" token and strings with a generic "str" token. We use https://joern.readthedocs.io/en/latest/ to analyze the source codes to get user-defined variables and functions.

In experiments, some of the data sets from the multimedia category were used as the source domain whilst other data sets from the image category were used as the target domain. For training and testing phases, we split the data of the source domain into two random partitions. The first partition contains 80% for training and the second partition contains 20% for validation. We also split the data of the target domain into two random partitions containing 80% for training without using any label information and 20% for testing the model. We additionally apply gradient clipping regularization to prevent over-fitting when training the models.

Please read the paper "Deep Domain Adaptation for Vulnerable Code Function Identification" (V. Nguyen et al. 2019, https://ieeexplore.ieee.org/abstract/document/8851923) for details about statistics of these data sets as well as the data sets used for the source and target domains.

Dataset Files

Full_Part1.zip (3.44 MB)

Datasets

Standard Dataset

Some software vulnerability real-world data sets

Abstract

Dataset Files

QUESTIONS?