Abstract

The data included here within is the associated model training results from the correlated paper "Distribution-Driven Augmentation of Real-World Datasets for Improved Cancer Diagnostics With Machine Learning". This paper focuses on using kernel density estimators to curate datasets by balancing classes and filling missing null values though synthetically generated data. Additionally, this manuscript proposes a technique for joining distinct datasets to train a model with necessary features from multiple different datasets as a type of transfer-learning. The specific data provided here is the performance results of each model in question (Naive Bayes, Logistic Regression, Support Vector Machine, Decision Tree, and a Voting Classifier) using 5-Fold Cross Validation. In particular, these models were evaluated using DDA, our novel solution, compared against other frequently used techniques.

Instructions:

Balancing_Data.xlsx: All model results from exclusively balancing classes

Null_Filling_Data.xlsx: All model results from exclusively filling null values

Joining_Data.xlsx: All model results from joining two datasets together and training a model

Synthetic_Data.xlsx: All model results from synthetically growing a dataset to with near-identical distributions

Cervical_Data.xlsx: All model results from performing class balancing and null-filling on a single dataset for a case study

Datasets

Standard Dataset

Model Performance Results For Distribution-Driven Augmentation of Medical Data

Abstract

Dataset Files

QUESTIONS?