Abstract

M. Kacmajor and J.D. Kelleher, "ExTra: Evaluation of Automatically Generated Source Code Using Execution Traces" (submitted to IEEE TSE)

In this paper we propose ExTra---a novel approach to evaluating code quality based on the comparison of execution traces of the generated code and the ground-truth code. ExTra captures the behaviour of the programs implemented with the generated code, taking into account all the internal and external dependencies. In contrast to source-code based metrics, ExTra is semantically meaningful; and in contrast to the evaluation approaches measuring the functional correctness of code, ExTra is suitable for evaluation of code developed in the context of real-life software systems.

The first contribution of this paper is the design, implementation, and validation of ExTra. The value of ExTra is examined via experiments in which our metric and three source-code based metrics (BLEU, Levenshtein distance and CodeBLEU) are applied to two types of automatically generated source code: test code and production code. The results show that the scores produced by the three source-code based metrics are highly correlated, while ExTra is clearly distinct. The qualitative analysis of the differences reveals a number of examples of ExTra scores being semantically more adequate than the scores computed based on token comparison. Furthermore, the quantitative analysis of the agreement between the evaluation scores and test verdicts---produced by generated test cases or by test cases applied to the generated code---shows that ExTra is a much better predictor of verdicts \textit{failed} than any of the three text-oriented metrics. On the whole, our results indicate that ExTra provides added value to the process of assessing the quality of the generated code, and we recommend it as an evaluation tool complementary to the source-code based methods.

The second contribution of this paper are three new evaluation datasets which contain executable code extracted from large, active Github repositories and can be used for evaluting models' performance using ExTra, or for other tasks that require executable code.

Instructions:

SpringTC is a dataset for code generation from natural language description, extracted from a large software repository available on GitHub under Apache License 2.0, spring-framework (https://github.com/spring-projects/spring-framework). SpringTC contains test code and is comprised from the training set (12,929 examples) and the executable test set (1,220 examples)

train.json: training set containing code examples paired with natural language description. Fields: "code", "nl"

test_exec.json: test set containing indexed code examples paired with natural language description. Fields: "code", "nl", "idx"

metadata.json: contains information need for the execution of the individual examples from the test set, in the context of the parent project. The metadata includes the path to the source file containing the ground truth test case (to enable substituting the ground truth code with the generatcode), and the fully qualified name of that test case (to enable its individual execution within the context of the software project). Fields: "testcaseFullname", "classComment", "thrownExceptions", "annotations", "className", "title", "modifiers", "body", "testCasesPerClass", "allContainedComments", "classModifiers", "containsStrings", "id", "packageName", "classOrphantComments", "containedStrings", "format", "methodName", "classMembers", "classNameNL", "comment", "classImports", "javadocComment", "parameters", "classJavadocComment", "origin_url", "filepath", "executable", "parent_project", "testcase_fullname", "idx".

Dataset Files

train.json (5.47 MB)
test_exec.json (562.28 kB)
metadata.json (4.73 MB)

Datasets

Standard Dataset

SpringTC - an executable text-code dataset

Abstract

More from this Author

SpringProd and ApacheProd - executable text-code...

Dataset Files

QUESTIONS?

Datasets

Standard Dataset

SpringTC - an executable text-code dataset

Abstract

More from this Author

Dataset Files

Related Datasets

QUESTIONS?