Abstract

Decompilation, the process of translating machine-level code into human-readable source code, is critical in reverse engineering. While its main purpose is to facilitate code comprehension when the original source code is unavailable, the understandability of decompiled code is important. However, existing research has mainly focused on the correctness of decompilation, with limited attention given to the decompilation understandability. Key questions regarding the value placed on understandability by decompiler stakeholders and potential methodologies for assessing decompiled code remain unresolved.

This paper presents the first empirical study on the understandability of Java decompiled code, including a user survey on the severity and prevalence of understandability issues in Java decompilation, along with a series of experiments comparing the understandability of source files from 14 Java projects and their corresponding decompiled outputs generated by three Java decompilers, with existing understandability metrics. An in-depth analysis of the survey findings reveals that:
    (1) Understandability of Java decompiled code is regarded as equally critical as its correctness, with understandability issues occurring more frequently.
    (2) A notable percentage of Java decompiled code exhibit significantly lower or higher understandability compared to source code.
    (3) Cognitive Complexity demonstrates relatively acceptable precision while low recall in recognizing the understandability divergence during decompilation.
    (4) Other existing metrics demonstrate even lower precision and recall in recognizing such understandability divergence.

Inspired by the findings, we propose an enhanced metric specific to Java decompilation, extending Cognitive Complexity by incorporating six additional rules. These rules respectively address six code patterns identified in our prior study as frequently contributing to low understandability in decompiled code. Experimental results demonstrate the metric's high precision and recall in identifying such low understandability cases in Java decompilation.

This data provides all the experimentals results provided in the paper and corresponding artifacts to reproduce these results.

Instructions:

# Demystifying and Assessing Code Understandability in Java Decompilation

Data and tools of the paper "Demystifying and Assessing Code Understandability in Java Decompilation".

## Data

Our data in the directory `data/` includes two directories `data/original` and `data/testset`, representing the original data set and the test set. Both directories include three parts:
1. Experimental data including source code and corresponding code decompiled by CFR, Fernflower and Jadx respectively in directory `code`.
2. Calculation results in directory `results`.
3. The annotated dataset `relative_understandability_<original/testset>.csv` denotes the relative understandability of the file decompiled by the decompiler compared to the original file, in which -1, indicating that the decompiled file is less understandable than the original file; 0, signifying equivalent; and 1, indicating more understandable.

## Tools

Our tools in the directory `tool/` includes tools for assessing the understandability of decompiled code with perplexity, Cognitive Complexity and Cognitive Complexity for Decompilation.

### Environment

- System: Ubuntu 20.04
- Python: python 3.10
```sh
pip install kenlm
pip install javalang
```
- Java: JDK >= 11

### Perplexity Calculator

`perplexity_calculator.py` calculates the perplexity of n-gram models for a Java file.
`5-gram.binary` is our 5-gram language model.

```sh
python perplexity_calculator.py <5-gram.binary> <file>
```

Where \<5-gram.binary\> represents path to the n-gram model, \<file\> represents the Java file to be evaluated.

### Cognitive Complexity Calculator and Cognitive Complexity for Decompilation Calculator

`CognitiveComplexityCalculator-1.0.jar` calculates the Cognitive Complexity for Java files.
`CognitiveComplexityforDecompilationCalculator-1.0.jar` calculates the Cognitive Complexity for Decompilation for Java files.

```sh
java -jar CognitiveComplexityCalculator-1.0.jar <input_directory> <output_file>
java -jar CognitiveComplexityforDecCalculator-1.0.jar <input_directory> <output_file>
```

Where \<input_directory\> represents the directory of all Java files to analyze, including all the files in the subdirectories. \<output_file\> represents the output file name.

The output file is a .csv file which contains the Cognitive Complexity or Cognitive Complexity for Decompilation value for each method. Specifically it contains:

- Absolute Module Path: The path of the class containing the method
- Module Position: The line in the .java file where the method starts
- Module declaration: The method signature and return type or pattern type (longLine)
- Max Nesting: The maximum level of nesting reached by the method (considering as 1 the starting level)
- Cognitive Complexity or Cognitive Complexity for Decompilation

## Reference

1. Cognitive Complexity Calculator: https://github.com/BruhZul/cognitive-complexity-calculator
2. MetricsReloaded: https://github.com/BasLeijdekkers/MetricsReloaded/
3. DepDigger: A Tool for Detecting Complex Low-Level Dependencies: https://www.sosy-lab.org/~dbeyer/DepDigger/

Dataset Files

understandability_in_Java_decompilation.zip (888.11 MB)

Datasets

Standard Dataset

understandability_in_Java_decompilation

Abstract

Dataset Files

QUESTIONS?