Datasets
Standard Dataset
FormAI Dataset: A Large Collection of AI-Generated C Programs and Their Vulnerability Classifications
- Citation Author(s):
- Submitted by:
- Norbert Tihanyi
- Last updated:
- Tue, 09/26/2023 - 05:10
- DOI:
- 10.21227/vp9n-wv96
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
FormAI is a novel AI-generated dataset comprising 112,000 compilable and independent C programs. All the programs in the dataset were generated by GPT-3.5-turbo using dynamic zero-shot prompting technique and comprises programs with varying levels of complexity. Some programs handle complicated tasks such as network management, table games, or encryption, while others deal with simpler tasks like string manipulation. Each program is labelled based on vulnerabilities present in the code using a formal verification method based on the Efficient SMT-based Bounded Model Checker (ESBMC). This strategy conclusively identifies vulnerabilities without reporting false positives (due to the presence of counter examples), or false negatives (up to a certain bound). The labeled samples can be utilized to train Large Language Models (LLMs) since they contain the exact program location of the software vulnerability.
Dataset Files
- FormAI dataset: Vulnerability Classification (No C source code included) FormAI_dataset_human_readable-V1.csv (15.95 MB)
- FormAI dataset: 112000 compilable AI-generated C code FormAI_dataset_C_samples-V1.zip (97.61 MB)
- FormAI dataset: Vulnerability Classification (C source code included in CSV) FormAI_dataset_classification-V1.zip (60.66 MB)