Datasets
Standard Dataset
Security Patch Variant
- Citation Author(s):
- Submitted by:
- Lin Li
- Last updated:
- Thu, 10/17/2024 - 03:10
- DOI:
- 10.21227/5qwj-tb49
- License:
- Categories:
- Keywords:
Abstract
Security patches play a crucial role in the battle
against Open Source Software (OSS) vulnerabilities. Meanwhile,
to facilitate the development of OSS projects, both upstream and
downstream developers often maintain multiple branches. Due
to the different code contexts among branches, multiple security
patch variants exist for the same vulnerability. Hence, to ease the
management of OSS vulnerabilities, locating all patch variants
of an OSS vulnerability is pretty important. However, existing
works are mainly designed for locating a patch or several patches
for a vulnerability but cannot locate all its patch variants.
In this paper, we study the problem of how to accurately locate
all variants of a given security patch. We motivate the problem
with a preliminary study, which shows that it is rather challenging
to locate all patch variants, even with a reference patch, due
to the diverse practice of OSS developers in backporting patches.
To overcome these challenges, we propose a new patch location
method to locate all variants of a patch in a code repository
(e.g., a software or a specific version). Based on our findings in
the preliminary study, our method employs a rule-based model
and incorporates two-dimensional code commit features that are
specifically designed for the task of patch variants locating:
similarity features and representative features. With a ground
truth patch variants dataset, our method achieves a precision of
99.68% and a recall of 98.81% and significantly outperforms two
state-of-the-art baselines (PATCHSCOUT and TRACER). Besides,
our method shows strong capability in locating patch variants at
both upstream and downstream code repositories.
# SPV Prototype___________________________
SPV is a security patch variants locating tool. This repository releases the experimental data and source code.
### Environment Configuration
1. Please install the following software and their recommanded versions by yourself.- Python 3.9- Mysql 8.0- git 2.25
2. Fill the configuration of Mysql and the root directory of *SPV* in `$WORKDIR$/spv/config.ini````[root_path]root_path = $WORKDIR$/spv/
[mysql]host = 10.176.xxx.xxport = 8888user = your_namepw = your_passwddb = your_database```
3. Install Python dependencies as follows:```$ cd $WORKDIR$/spv/$ pip install -r requirements.txt```
### Quick Start
#### 1. Data preparation- **Reference patches** *SPV* needs a reference patch as input. To be specific, an Excel file with the header `['CVE', 'GitHub Repository', 'Reference Patch(es)']` is required. This file should contain the necessary information about reference patches, which can be collected manually. An example is provided in `$WORKDIR$/spv/input/reference.xlsx`
- **Local code repository** *SPV* locates the variants of the reference patch from local git repository. It is recommanded to save the git repositories under `$WORKDIR$/spv/repo/`. For example, to save *zulip* git repository from remote:```$ cd $WORKDIR$/spv/repo/$ git clone https://github.com/zulip/zulip.git``` Note that the local repository should be saved with the exactly directory name specified in the `'GitHub Repository'` field of the reference information file.
#### 2. Cache the information of repository to database.
*SPV* extracts necessary information from the local git repository and caches them to a database. To achieve this, prepare a list where the directory name of local repository is provided, like `$WORKDIR$/spv/input/repo_list.json`.
```[ "zulip"]```Then, cache them with following commands:
```$ cd $WORKDIR$/spv/src/$ python spv.py -cache repo_list.json --commit --title --diff```
By default, *spv* will search the repositories under `$WORKDIR$/spv/repo/`. You could change the directory by modifying the `[repository]repodir` in `config.ini`.
#### 3. Locate variants.*SPV* locates variants based on the cached information in the database. To do so, prepare a list of CVE to be predicted, like `$WORKDIR$/spv/input/cve_list.json`.```[ "CVE-2017-0881", "CVE-2020-14194", "CVE-2021-30477"]```
Run *SPV* to locate patches```$ cd $WORKDIR$/spv/src/$ python spv.py -predict cve_list.json```
By default, *SPV* will find the reference patch information in `$WORKDIR$/spv/input/reference.xlsx`. You can use `-infofile` to specified another file, like:```$ cd $WORKDIR$/spv/src/$ python spv.py -predict cve_list.json -infofile new_reference.xlsx```
#### 4. ResultsBy default, *SPV* will save the results under `$WORKDIR$/spv/results/` with name `predict-{date}.json`.
### Experiment reproducation
#### 1. Data
Files in *data* directory provide necessary data to reproduce the experiments of our paper.
The dataset we construct for Research Question 1~3 includes:- `reference.xlsx` containts the reference patches of 737 CVEs.- `ground_truth.xlsx` contains the patch variants of 737 CVEs.- `rq1_cve.json` contains CVEs used for RQ1.- `rq1_shuffled.xlsx` is composed of 10 epoches. Each epoch has exactly same CVEs as `rq1_cve.json` but in different order.- `rq2_cve.json` contains CVEs used for RQ2.- `rq3_cve.json` contains CVEs used for RQ3.
The newly collected dataset from NVD for RQ4 includes:- `rq4_cve.json` contains 432 CVEs used for RQ4.- `rq4-reference.xlsx` contains the reference patches of the 432 CVEs.- `rq4-checked_cve.json` contains 45 CVEs that we manually collected ground truth for.- `rq4-ground_truth.xlsx` contains the manullay collected patch variants of the 45 CVEs.
#### 2. Reproduction
**Research Question 1**
Run command:```$ cd $WORKDIR$/spv/src/$ python exp/rq1_exp.py```
**Research Question 2 & 3**
Follow the steps in *Quick Start* and use corresponding files (`$WORKDIR$/spv/data/reference.xlsx`), but add `--exp` for the experiments. `--exp` is used to accomodate to the range of affected branches in the dataset.
For example, run RQ2 by```$ cd $WORKDIR$/spv/src/$ python spv.py -predict ../data/rq2_cve.json -infofile ../data/reference.xlsx --exp```
**Research Question 4**
Follow the steps in *Quick Start* and use corresponding files (`$WORKDIR$/spv/data/rq4-reference.xlsx`).
#### 3. Training (optional)
1. Load the `training_info.sql` to your mysql database, which contains the pair set for training.
2. Add `train=True` in `main(infofile, shuffled_file)` at the end of `exp/rq1_exp.py` and run command:
```$ cd $WORKDIR$/spv/src/$ python exp/rq1_exp.py```
Comments
iam student