
Turbulence is a new benchmark and automated testing framework based on the question neighbourhood approach for systematically evaluating the accuracy (the overall rate of correctness across all generated outputs), correctness potential (whether the LLM produces at least one correct output for a given input), and consistent correctness (the model’s ability to consistently produce correct outputs for the same input across successive generations) of instruction-tuned large language models (LLMs) for code generation.
- Categories: