Turbulence-Benchmark-v2

Citation Author(s):: Shahin Honarvar (Imperial College London)

Alastair Donaldson (Imperial College London)
Submitted by:: Shahin Honarvar
Last updated:: Sun, 05/11/2025 - 05:49
DOI:: 10.21227/pmmv-nt11
Data Format:: *.json;*.py;*.txt;*.zip

28 views

Categories:

Artificial Intelligence

Keywords:

Automated Software Testing

large language models

AI in Software Engineering

automatic evaluation

Code Evaluation

Benchmark

ACCESS DATASET CITE

Abstract

Turbulence is a new benchmark and automated testing framework based on the question neighbourhood approach for systematically evaluating the accuracy (the overall rate of correctness across all generated outputs), correctness potential (whether the LLM produces at least one correct output for a given input), and consistent correctness (the model’s ability to consistently produce correct outputs for the same input across successive generations) of instruction-tuned large language models (LLMs) for code generation.

Turbulence consists of a large set of natural language question templates—each a parameterised programming problem that can be instantiated in many different forms. Each template is paired with a test oracle that determines whether a code solution returned by an LLM is correct. This allows the generation of a neighbourhood of closely related questions from a single template, enabling fine-grained assessment of model behaviour across similar tasks.

Turbulence systematically identifies cases where an LLM can solve some instances within a neighbourhood but fails to generalise across the entire set. By employing accuracy, correctness potential, and consistent correctness as core metrics, Turbulence provides a structured methodology to reveal model weaknesses and offers a more nuanced characterisation of LLM behaviour in structured problem spaces.

This version (v2) is the direct update to version 1. In version 2, two new metrics and rigorous statistical analysis have been added to the source code.

Instructions:

To execute the benchmark, please ensure that your operating system is either Linux or macOS, and that Python version 3.10 or higher is installed. Follow the steps below in the specified order:

Step 1: Install Required Libraries
Open the terminal and use the cd command to move into the Turbulence folder. Then run:

pip install -r requirements.txt

Step 2: Set API Keys as Environment Variables
Configure the API keys as environment variables on your operating system. Depending on the API and the platform you use to access the LLM, you may need to modify the content of run_llm.py.

Step 3: Define Prerequisite Settings
Download the Source_Code folder. In the config.json file, define the prerequisite settings as outlined below. The keys must remain unchanged, but their values can be adjusted based on your preferences.

task:
Set this to "run_llm" when generating responses from the LLM, or "test" when testing previously generated responses.
model_specifications:
Define the LLM’s name, maximum generation length (`max_tokens`), and temperature.
seed:
Sets the random seed for reproducibility.
questions:
Specify which question templates to use:A single number (e.g., 34) for one question template.
- A range (e.g., 24-45) for consecutive question templates.
- A list (e.g., 1, 5, 57) for non-consecutive question templates. The order of numbers does not matter.
- A combination (e.g., 1, 5, 57, 25-45) is also allowed.
- To use all templates, write 1-60.
number_of_parameters:
Sets the number of instances per neighbourhood.
shuffle:
If "True", shuffles question instances within each neighbourhood before sending them to the LLM.
fuzzy_testing:
If "True", runs additional fuzz testing after the fixed test suites. If "False", skips fuzz testing to reduce evaluation cost.
number_of_fuzzy_inputs:
The number of random inputs to be generated for the fuzzy testing phase.
terminates_with_first_error:
If "True", stops the testing campaign upon encountering the first failure.
allowed_time and allowed_memory:
Set the time (in seconds) and memory (in GB) limits for executing each test. These settings help control evaluation cost and prevent the testing process from stalling due to overly complex or resource-intensive outputs generated by the LLM.
number_of_rounds:
Number of times each instance is sent to the LLM.
num_of_processes:
Number of processes to use during testing.
confidence_level:
Statistical confidence level for analysis.
path:
Absolute path to the Turbulence source code and config file.