Abstract

Overview

The dataset under consideration is a comprehensive compilation of code snippets, function descriptions, and their respective binary representations aimed at fostering research in software engineering. It contains a variety of code functionalities and serves as a valuable resource for understanding the behavior and characteristics of C programs. This data is sourced from the AnghaBench repository, a well-documented collection of C programs available on GitHub.

Columns and Data Types

The dataset contains the following columns:

Name: The identifier for each code snippet, such as filenames or function names.

Comment: A brief description or explanation about what each code snippet aims to accomplish.

Source: The source code snippet itself, which can be in C or another language.

Binary: The corresponding LLVM Intermediate Representation or other binary forms of the code snippet.

Use Case and Relevance

The dataset is intended to serve as a rich resource for researchers and practitioners in the field of software engineering, specifically those focusing on code analysis, benchmarking, and optimization.

Data Source

The raw data for this dataset was originally drawn from the AnghaBench repository, a comprehensive collection of C programs designed to aid in various software engineering tasks including benchmarking and code analysis.

Submission

This dataset is prepared for submission to the IEEE Transactions on Software Engineering (IEEE-TSE) journal, a prestigious venue for contributions in the realm of software engineering.

Instructions:

Instructions for Using the Dataset File

Prerequisites

Ensure that you have enough disk space to accommodate the dataset file.

Make sure you have Python installed if you intend to perform any programmatic data manipulations.

Steps

Download Dataset:

Download the dataset file from the provided source. If the file is hosted on a platform like GitHub, you can use the Download button or clone the repository.

Inspect the File:

Open the dataset file in a text editor for a quick glance, or better yet, use a data manipulation library like Pandas in Python to take an initial look at the data.

Data Preprocessing:

Depending on your specific requirements, you may need to clean the data or filter out unnecessary rows or columns.

Analysis or Modelling:

Now that you have an understanding of the dataset, you can proceed to analysis or modelling. Use statistical software or programming languages like Python for this step.

Validation:

After performing your analyses or running your models, make sure to validate the results through appropriate techniques, such as cross-validation for machine learning models.

Documentation:

Keep track of the changes you make during preprocessing and analysis. This is crucial for reproducibility and for anyone else who might use this dataset in the future.

By following these steps, you should be able to effectively utilize the dataset for your research or project needs.

Funding Agency:

DARPA

Grant Number:

N6600120C4026

Dataset Files

contrabin_pretrain.csv (2.69 GB)
contrabin_script.py (1.94 kB)

Datasets

Standard Dataset

Pre-Training Representations of Binary Code Using Contrastive Learning

Abstract

Dataset Files

QUESTIONS?