Datasets

Open Access

Deduplication Index for Big Code Datasets

Dataset image

Citation Author(s):: Miltiadis Allamanis (Microsoft Research)
Submitted by:: Miltiadis Allamanis
Last updated:: Thu, 06/27/2019 - 15:47
DOI:: 10.21227/e9eb-ma51
Data Format:: Zip File containing .json
Links:: GitHub

681 views

Categories:

Keywords:

big code; duplication

Abstract

Code duplicates in large code corpora have adverse effects on the evaluation and use of machine learning models that rely on them. Most existing corpora suffer from this problem to some extent. This dataset contains a "duplication" index for some of the existing corpora in Big Code research. The method for collecting this dataset is described in "The Adverse Effects of Code Duplication in Machine Learning Models of Code" by Allamanis [ArXiV, to appear in SPLASH 2019].

Instructions:

For each of the existing datasets, a single .json file is provided. Each JSON file has the following format:

[ duplicate_group_1, duplicate_group_2, ...]

where each duplicate group is a list of filenames of that dataset that are near duplicates.

For the corpora that were given as a single file (e.g. Hashimoto et al.) the line number of the original record is given.

Dataset Files

duplication_index.zip (Size: 11.8 MB)

LOGIN TO ACCESS DATASET FILES

Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.

QUESTIONS?

More like this Dataset

Dataset image

Retinal Fundus Multi-disease Image Dataset (RFMiD)

Dataset image

Experimental database for detecting and diagnosing rotor broken bar in a three-phase induction motor.

Tag image

Automotive Li-ion Cell Usage Data Set

Dataset image

Date Fruit Dataset for Automated Harvesting and Visual Yield Estimation

Dataset image

Thermal image dataset for person detection - UNIRI-TID

Dataset image

SEARCH AND RESCUE IMAGE DATASET FOR PERSON DETECTION - SARD