Data and tools of the paper "Demystifying and Assessing Code Understandability in Java Decompilation"

Citation Author(s):
Ruixin
Qin
Submitted by:
Ruixin Qin
Last updated:
Wed, 06/19/2024 - 09:20
DOI:
10.21227/r91g-ka53
License:
0
0 ratings - Please login to submit your rating.

Abstract 

The dataset accompanying the paper "Demystifying and Assessing Code Understandability in Java Decompilation" is structured to facilitate a comprehensive analysis of code understandability in Java decompilation. The data is organized into two main directories: data/original and data/testset, representing the original dataset and the test set, respectively. Each directory encompasses three components. First, the code directory contains experimental data, including source code and corresponding decompiled code produced by three decompilers: CFR, Fernflower, and Jadx. Second, the results directory holds the outcomes of various metrics calculations pertinent to the study. Lastly, the annotated dataset files provide a comparative analysis of the relative understandability of decompiled files against the original source files. This dataset is integral for evaluating and comparing the effectiveness of different metrics in assessing the code understandability in the context of decompilation.

Instructions: 

Data

Our data in the directory `data/` includes two directories `data/original` and `data/testset`, representing the original data set and the test set. Both directories include three parts:
1. Experimental data including source code and corresponding code decompiled by CFR, Fernflower and Jadx respectively in directory `code`.
2. Calculation results in directory `results`.
3. The annotated dataset `relative_understandability_<original/testset>.csv` denotes the relative understandability of the file decompiled by the decompiler compared to the original file, in which -1, indicating that the decompiled file is less understandable than the original file; 0, signifying equivalent; and 1, indicating more understandable.

Tools

Our tools in the directory `tool/` includes tools for assessing the understandability of decompiled code with perplexity, Cognitive Complexity and Cognitive Complexity for Decompilation.

Environment
- System: Ubuntu 20.04
- Python: python 3.10
pip install kenlm
pip install javalang
- Java: JDK >= 11

Perplexity Calculator

`perplexity_calculator.py` calculates the perplexity of n-gram models for a Java file.
`5-gram.binary` is our 5-gram language model.

Usage:
> python perplexity_calculator.py <5-gram.binary> <file>
Where <5-gram.binary> represents path to the n-gram model, <file> represents the Java file to be evaluated.

Cognitive Complexity Calculator and Cognitive Complexity for Decompilation Calculator

`CognitiveComplexityCalculator-1.0.jar` calculates the Cognitive Complexity for Java files.
`CognitiveComplexityforDecompilationCalculator-1.0.jar` calculates the Cognitive Complexity for Decompilation for Java files.

Usage:
> java -jar CognitiveComplexityCalculator-1.0.jar <input_directory> <output_file>
> java -jar CognitiveComplexityforDecCalculator-1.0.jar <input_directory> <output_file>
Where <input_directory> represents the directory of all Java files to analyze, including all the files in the subdirectories. <output_file> represents where the output file is created.

The output file is a .csv file which contains the Cognitive Complexity or Cognitive Complexity for Decompilation value for each method. Specifically it contains:
- Absolute Module Path: The path of the class containing the method
- Module Position: The line in the .java file where the method starts
- Module declaration: The method signature and return type or pattern type
- Max Nesting: The maximum level of nesting reached by the method (considering as 1 the starting level)
- Cognitive Complexity or Cognitive Complexity for Decompilation

Reference
1. Cognitive Complexity Calculator: https://github.com/BruhZul/cognitive-complexity-calculator