FormAI Dataset: A Large Collection of AI-Generated C Programs and Their Vulnerability Classifications

Citation Author(s):
Norbert
Tihanyi
Technology Innovation Institute
Tamas
Bisztray
University of Oslo
Ridhi
Jain
Technology Innovation Institute
Mohamed
Amine Ferrag
Technology Innovation Institute
Lucas
C. Cordeiro
University of Manchester
Vasileios
Mavroeidis
University of Oslo
Submitted by:
Norbert Tihanyi
Last updated:
Tue, 09/26/2023 - 05:10
DOI:
10.21227/vp9n-wv96
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

FormAI is a novel AI-generated dataset comprising 112,000 compilable and independent C programs. All the programs in the dataset were generated by GPT-3.5-turbo using dynamic zero-shot prompting technique and comprises programs with varying levels of complexity. Some programs handle complicated tasks such as network management, table games, or encryption, while others deal with simpler tasks like string manipulation. Each program is labelled based on vulnerabilities present in the code using a formal verification method based on the Efficient SMT-based Bounded Model Checker (ESBMC). This strategy conclusively identifies vulnerabilities without reporting false positives (due to the presence of counter examples), or false negatives (up to a certain bound). The labeled samples can be utilized to train Large Language Models (LLMs) since they contain the exact program location of the software vulnerability.