Datasets
Standard Dataset
Decompiled code dataset

- Citation Author(s):
- Submitted by:
- Zhou Zhiping
- Last updated:
- Tue, 01/28/2025 - 10:14
- DOI:
- 10.21227/mwdw-8x69
- License:
- Categories:
- Keywords:
Abstract
The dataset used in this study is sourced from benchmark datasets~\cite{marcelli2022machine} for binary similarity detection and was decompiled using \textit{IDA Pro 7.5}. We selected the following datasets for evaluation: \textit{Coreutils-ARM-32}, \textit{Curl-MIPS-32}, \textit{ImageMagick-ARM-32}, \textit{OpenSSL-X86-32}, \textit{Putty-X86-32}, and \textit{SQLite-X86-32}. These datasets are commonly used, but their application scenarios are relatively limited. To further validate the effectiveness of our approach in diverse, real-world scenarios, we introduced the popular GitHub algorithm library \textit{CAlgorithm-X86-64}~\cite{TheAlgorithms_C}. This library, with 43.4k followers, has a significantly larger and more diverse user base compared to the other six datasets, thus enhancing the representativeness and generalizability of our detection results. By incorporating this widely recognized library, we aim to demonstrate that our method can effectively handle a broader range of practical applications, ensuring robust performance and adaptability.
The variation in the number of function pairs selected from these seven projects reflects the differences in library size and complexity. Based on the source code size, we randomly selected 55, 110, 60, 150, 55, 90, and 100 pairs of decompiled functions and their corresponding source code, totaling 620 pairs. To ensure alignment with the source code, we chose an optimization level of O0. The distortion types, labeled \textit{I1} to \textit{I6}, were manually annotated, resulting in over 40,000 lines of code.
The code pairs consist of source code and the corresponding decompiled code, where the source code is annotated with distortion labels for comparative analysis of distortion detection results.