Deduplication Index for Big Code Datasets

Deduplication Index for Big Code Datasets

Citation Author(s):
Miltiadis
Allamanis
Microsoft Research
Submitted by:
Miltiadis Allamanis
Last updated:
Thu, 06/27/2019 - 11:47
DOI:
10.21227/e9eb-ma51
Data Format:
Links:
License:
Creative Commons Attribution
Dataset Views:
102
Share / Embed Cite

Code duplicates in large code corpora have adverse effects on the evaluation and use of machine learning models that rely on them. Most existing corpora suffer from this problem to some extent. This dataset contains a "duplication" index for some of the existing corpora in Big Code research. The method for collecting this dataset is described in "The Adverse Effects of Code Duplication in Machine Learning Models of Code" by Allamanis [ArXiV, to appear in SPLASH 2019].

 

Instructions: 

For each of the existing datasets, a single .json file is provided. Each JSON file has the following format:

 

[ duplicate_group_1, duplicate_group_2, ...]

 

where each duplicate group is a list of filenames of that dataset that are near duplicates.

 

For the corpora that were given as a single file (e.g. Hashimoto et al.) the line number of the original record is given.

Dataset Files

You must login with an IEEE Account to access these files. IEEE Accounts are FREE.

Sign Up now or login.

Embed this dataset on another website

Copy and paste the HTML code below to embed your dataset:

Share via email or social media

Click the buttons below:

facebooktwittermailshare
[1] , "Deduplication Index for Big Code Datasets", IEEE Dataport, 2019. [Online]. Available: http://dx.doi.org/10.21227/e9eb-ma51. Accessed: Oct. 16, 2019.
@data{e9eb-ma51-19,
doi = {10.21227/e9eb-ma51},
url = {http://dx.doi.org/10.21227/e9eb-ma51},
author = { },
publisher = {IEEE Dataport},
title = {Deduplication Index for Big Code Datasets},
year = {2019} }
TY - DATA
T1 - Deduplication Index for Big Code Datasets
AU -
PY - 2019
PB - IEEE Dataport
UR - 10.21227/e9eb-ma51
ER -
. (2019). Deduplication Index for Big Code Datasets. IEEE Dataport. http://dx.doi.org/10.21227/e9eb-ma51
, 2019. Deduplication Index for Big Code Datasets. Available at: http://dx.doi.org/10.21227/e9eb-ma51.
. (2019). "Deduplication Index for Big Code Datasets." Web.
1. . Deduplication Index for Big Code Datasets [Internet]. IEEE Dataport; 2019. Available from : http://dx.doi.org/10.21227/e9eb-ma51
. "Deduplication Index for Big Code Datasets." doi: 10.21227/e9eb-ma51