Our defect dataset, comes from PROMISErepository. This data refers to open-source Java systems such as ant, camel, ivy, jedit, log4j, lucene, poi, synapse, velocity and xerces. We selected these datasets since they have at least three consecutive releases (where release i was built before release i+1). This will allow us to build defect predictors based on the past data and then predict (test) defects on future version projects, which will be a more practical scenario.
The original dataset contains a list of bugs, their characteristics and the classes to which they belong. The first step was to remove the values which belonged to class 0. The values left belonged to the defective classes. For untuned methods release i and release i+1 were combined for training purposes and tested on release i+2.For tuned methods release i was used for training, release i+1 for tuning and release i+2 for testing.
Eg.: release i in antV0 contains 20 defect classes out of 125 which was used for training and release i+1 which was used for tuning contains 40 defect classes out of 178.
The analysis procedure involved the use of Pandas library available in Python to process the dataset as per our requirements.