FormAI Dataset: A Large Collection of AI-Generated C Programs and Their Vulnerability Classifications

Citation Author(s):
Norbert
Tihanyi
Technology Innovation Institute
Tamas
Bisztray
University of Oslo
Ridhi
Jain
Technology Innovation Institute
Mohamed
Amine Ferrag
Technology Innovation Institute
Lucas
C. Cordeiro
University of Manchester
Vasileios
Mavroeidis
University of Oslo
Submitted by:
Norbert Tihanyi
Last updated:
Tue, 09/26/2023 - 05:10
DOI:
10.21227/vp9n-wv96
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

FormAI is a novel AI-generated dataset comprising 112,000 compilable and independent C programs. All the programs in the dataset were generated by GPT-3.5-turbo using dynamic zero-shot prompting technique and comprises programs with varying levels of complexity. Some programs handle complicated tasks such as network management, table games, or encryption, while others deal with simpler tasks like string manipulation. Each program is labelled based on vulnerabilities present in the code using a formal verification method based on the Efficient SMT-based Bounded Model Checker (ESBMC). This strategy conclusively identifies vulnerabilities without reporting false positives (due to the presence of counter examples), or false negatives (up to a certain bound). The labeled samples can be utilized to train Large Language Models (LLMs) since they contain the exact program location of the software vulnerability.

Instructions: 
 
The dataset includes three files:
 
a. FormAI_dataset_C_samples-V1.zip - This file contains all the 112,000 C files.
 
b. FormAI_dataset_classification-V1.zip - This file contains a CSV file with the following six columns: 1. File name, 2. Vulnerability classification, 3. C source code, 4. Vulnerable function name, 5. Line number, 6. Vulnerability type. One C program can contain multiple vulnerabilities. As such this file has 246550 rows. In the fifth column, each C program preserves the structure of the original code by containing the newline character.
 
c. FormAI_dataset_human_readable-V1.csv - Human readable version, containing only 5 columns without the C code.
 
Dataset usage 
 
All C programs in this dataset were generated by GPT-3.5-turbo. From the entire dataset 109,757 (97.9%) sample files can be compiled using only the command gcc -lm.
However, the remaining 2.1% of samples are more complex, utilizing external libraries such as OpenSSL, sqlite3, and others. After installing the required dependencies, all files should compile without any issues.
 
The FormAI dataset is suitable for machine learning training, testing vulnerability detection softwares, and various tasks that require a substantial collection of C files. 
 
In this guide, we illustrate the process of classifying the samples in the FormAI dataset and how to extract the C source code from the categorized csv files. This serves as an ideal foundation for those looking to build upon these findings in future research.
 
1. Start by obtaining the entire FormAI dataset from Github with the given command:
 
 
 
2. The following sequence of commands includes an operating system upgrade and the installation of the build-essential package, a crucial requirement for the proper compilation of the dataset.
 
   sudo apt update && sudo apt upgrade -y && sudo apt install build-essential -y   
 
   
3. The subsequent step involves installing all the necessary dependencies to compile the 112,000 C files present in the dataset.
 
    sudo apt install -y build-essential csvtool unzip libsqlite3-dev libssl-dev libmysqlclient-dev libpq-dev libportaudio2 portaudio19-dev libpcap-dev libqrencode-dev libsdl2-dev freeglut3-dev libcurl4-openssl-dev
 
4. Unzip the .C samples from the compressed file (this action will generate a DATASET directory containing all the files):
 
 
     unzip FormAI-dataset/FormAI_dataset_C_samples-V1.zip && unzip FormAI-dataset/FormAI_dataset_classification-V1.zip
 
 
5. One can verify the successful installation of the dependencies by testing 1000 files from the dataset. This command will verify that the first 1000 files from the dataset compile successfully (this process usually takes around 2-3 minutes). 
 
 
     find DATASET -name "*.c" | head -n 1000 | xargs -I{} bash -c 'gcc {} -w -lcrypto -pthread -lsqlite3 -lmysqlclient -lpq -lssl -lportaudio -lpcap -lqrencode -lSDL2 -lglut -lGLU -lGL -lcurl -lm &>/dev/null &&  echo {}' | wc -l
 
 
     NOTE: If the output is not equal to 1000, it indicates that there may be missing dependencies or that certain programs fail to compile, particularly in earlier versions of GCC. This situation can occur in Ubuntu 20.04 LTS, where the default GCC version is 9.4.0, which is older than 11.4.0. However, it's important to note that this should not impact the usability of the DATASET.
   
6. If the result is 1000 from the previous run, it indicates that all the tested files were compiled without issues. Our system is ready to use the provided files. To retrieve and compile the C code from a specific row in the FormAI_dataset.csv, use the command below (taking row number 3679 as a example):
 
   
   gcc -x c <(csvtool drop 3679 FormAI_dataset.csv | csvtool -t ',' head 1 - | csvtool -t ',' col 3 - | sed 's/^"//; s/"$//' | sed 's/""/"/g') -lcrypto -pthread -lsqlite3 -lmysqlclient -lpq -lssl -lportaudio -lpcap -lqrencode -lSDL2 -lglut -lGLU -lGL -lcurl -lm -o output
   
 
7. Each program is categorized according to the vulnerabilities found in its code, using a formal verification technique that leverages the Efficient SMT-based Bounded Model Checker (ESBMC).
 
   ESBMC is a mature, permissively licensed open-source context-bounded model checker for verifying single- and multithreaded C/C++, Kotlin, and Solidity programs. It can automatically verify predefined safety properties (e.g., bounds check, pointer safety, overflow) and user-defined program assertions.
 
   The Github page for ESBMC can be found at the following link: https://github.com/esbmc/esbmc
   ESBMC depends on numerous external tools and has a variety of intricate dependencies, from SMT solvers to more complex ones. Nevertheless, there's an option to utilize a pre-compiled binary version, which can be obtained using the command provided below:
 
   
   wget https://github.com/esbmc/esbmc/releases/download/v7.3/ESBMC-Linux.zip && unzip ESBMC-Linux.zip && rm ESBMC-Linux.zip && chmod 777 bin/esbmc
   
8. If you wish to test ESBMC on an individual file from the FormAI dataset, use the command below (taking FormAI_14569.c as a reference):
 
   
   bin/esbmc DATASET/FormAI_14569.c  --overflow --memory-leak-check --timeout 30 --unwind 1 --multi-property --no-unwinding-assertions
 
 
   The file appears to be vulnerable, revealing a buffer overflow in the scanf() function. Below is the output from ESBMC:
 
    
      State 6 file FormAI_14569.c line 24 column 9 function main thread 0
      ----------------------------------------------------
      Violated property:
        file FormAI_14569.c line 24 column 9 function main
        buffer overflow on scanf
        0
   
      VERIFICATION FAILED
   
 
9. As a concluding action, you can confirm that this vulnerability is indeed present in the FormAI dataset by executing the command below:
 
   
   cat FormAI-dataset/FormAI_dataset_human_readable-V1.csv | grep 'FormAI_14569.c'
   
The outpus is the same as ESBMC found :  FormAI_14569.c,VULNERABLE,main,24.0,buffer overflow on scanf
 
 
 
Disclaimer 
 
   
SHA256 checksum of the files:
 
FormAI_dataset_C_samples-V1.zip : fc458020ad0e9b0999882d4bfdd27edfdee2f0ff5de22848e11352d64230ca47
FormAI_dataset_classification-V1.zip : 82b00611d71ff992d85b7bba20ebece580bfd055939b0142861326dff17ede74
FormAI_dataset_human_readable-V1.csv : 316d7145006f48ec42c9a89b1dccb2c5e2c05012c926af65c0269aba45d47830
 
WARNING: BE CAREFUL WHEN RUNNING THE COMPILED CODES, SOME CAN CONNECT TO THE WEB, SCAN YOUR LOCAL NETWORK, OR DELETE A RANDOM FILE FROM YOUR FILE SYSTEM. ALWAYS CHECK THE SOURCE CODE AND THE COMMENTS IN THE FILE BEFORE RUNNING IT!!!