Code duplicates in large code corpora have adverse effects on the evaluation and use of machine learning models that rely on them. Most existing corpora suffer from this problem to some extent. This dataset contains a "duplication" index for some of the existing corpora in Big Code research. The method for collecting this dataset is described in "The Adverse Effects of Code Duplication in Machine Learning Models of Code" by Allamanis [ArXiV, to appear in SPLASH 2019].

 

Dataset Files

You must be an IEEE Dataport Subscriber to access these files. Subscribe now or login.

[1] Miltiadis Allamanis, "Deduplication Index for Big Code Datasets", IEEE Dataport, 2019. [Online]. Available: http://dx.doi.org/10.21227/e9eb-ma51. Accessed: Feb. 06, 2025.
@data{e9eb-ma51-19,
doi = {10.21227/e9eb-ma51},
url = {http://dx.doi.org/10.21227/e9eb-ma51},
author = {Miltiadis Allamanis },
publisher = {IEEE Dataport},
title = {Deduplication Index for Big Code Datasets},
year = {2019} }
TY - DATA
T1 - Deduplication Index for Big Code Datasets
AU - Miltiadis Allamanis
PY - 2019
PB - IEEE Dataport
UR - 10.21227/e9eb-ma51
ER -
Miltiadis Allamanis. (2019). Deduplication Index for Big Code Datasets. IEEE Dataport. http://dx.doi.org/10.21227/e9eb-ma51
Miltiadis Allamanis, 2019. Deduplication Index for Big Code Datasets. Available at: http://dx.doi.org/10.21227/e9eb-ma51.
Miltiadis Allamanis. (2019). "Deduplication Index for Big Code Datasets." Web.
1. Miltiadis Allamanis. Deduplication Index for Big Code Datasets [Internet]. IEEE Dataport; 2019. Available from : http://dx.doi.org/10.21227/e9eb-ma51
Miltiadis Allamanis. "Deduplication Index for Big Code Datasets." doi: 10.21227/e9eb-ma51