The effectiveness of hidden dependence metrics in bug prediction - online appendix

Citation Author(s):
Judit
Jász
Submitted by:
Judit Jasz
Last updated:
Wed, 03/06/2024 - 07:55
DOI:
10.21227/t74t-vj82
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

This dataset contains the online appendix of the paper titled "The effectiveness of hidden dependence metrics in bug prediction"

Abstract:

 

Finding and fixing bugs in programs is perhaps one of the most difficult, yet most important, tasks in software maintenance. This is why in the last decades, a lot of work has been done on this topic, most of which is based on machine learning methods. Studies on bug prediction can be found for almost all programming languages. The solutions presented generally try to predict bugs based on information that can be easily extracted from the source code, rather than more expensive solutions that require a deeper understanding of the program. Another feature of these solutions is that they usually try to predict faults at a high level (module/file/class), which is useful, but locating the bug itself is still a difficult task.

In this work, we present a solution that attempts to predict bugs at the method level, while also tracking the dependencies in the program using an efficient algorithm, resulting in an approach that can predict bugs more accurately. Our practical measurements show that our defined approach really outperforms predictions based on traditional metrics in most cases, and with proper filtering, we can even achieve an 11% improvement in the case of the best-performing RandomForest algorithm according to F-measure. Finally, we also prove that the introduced metrics are even suitable for predicting bugs that will appear later in a given project if sufficient learning data is available.

Instructions: 

This online appendix contains the data accompanying the "The effectiveness of hidden dependence metrics in bug prediction" paper as follows:

* additional_data: 

      This directory contains a single file: methods_lineinfo.csv. 

      The purpose of this file is to supplement the BugHunter dataset referenced by the paper by providing the location of each method at the file level, in addition to its name. This may make the methods easier to identify for those who would not use a separate parser to determine the method names. 

* arffs:

      This directory contains the arffs files that were used to make our measurements using the Weka software. 

      * method-subtract_intersection.arff: the original dataset of the Bughunter dataset with the subtract filtering without the methods that have not been identified or analysed.  

      * sea-subtract_intersection.arff: SEA metrics for methods in the method-subtract dataset, for RQ1 comparisons.

      * sea-all.arff: the sea metrics of all the methods we analysed, which is the data set of RQ2 and RQ3

* initial data: or raw data 

      Without further details, these are the files that were the outputs of the Bughunter dataset and our own metrics calculations. These files are the ones from which we determined the arff files for our studies.

* results: 

      Summary of the results obtained, numerical results, which could not be listed in detail due to lack of space in the paper.