Skip to main content

Datasets

Standard Dataset

TCE-2023-08-1046.R1_DATASETS

Citation Author(s):
Asitha Kottahachchi kankanamge Don (RMIT University)
Submitted by:
Asitha Kottahachchi K D
Last updated:
DOI:
10.21227/5cr5-0204
Data Format:
No Ratings Yet

Abstract

In medical applications, machine learning often grapples with limited training data. Classical self-supervised deep learning techniques have been helpful in this domain, but these algorithms have yet to achieve the required accuracy for medical use. Recently quantum algorithms show promise in handling complex patterns with small datasets. To address this challenge, this study presents a novel solution that combines self-supervised learning with Variational Quantum Classifiers (VQC) and utilizes Principal Component Analysis (PCA) as the dimensionality reduction technique. This unique approach ensures generalization even with a small training dataset while preserving data privacy, a vital consideration in medical applications. PCA is effectively utilized for dimensionality reduction, enabling VQC to operate with just 2 Q-bits, overcoming current quantum hardware limitations, and gaining an advantage over classical methods. In this study, four medical datasets (PneumoniaMNIST, BreastMNIST, PathMNIST, ChestMNIST) and two non-medical datasets (Hymenoptera Ant & Bees, Kaggle Cats, and Dogs Dataset) were employed. During the self-supervised learning stage, we applied supervised contrastive learning to the above datasets, resulting in the creation of 2048-feature dimension datasets for each dataset. Subsequently, the 2048 feature dataset underwent data preprocessing steps and principal component analysis, yielding two feature datasets for each 2048 feature dataset. The comprehensive dataset comprises six sets of 2048 features and six sets of two features. The final two-feature dataset was utilized in conjunction with the variational quantum classifier.

Instructions:

Each of the 2048-feature datasets includes data columns ranging from f1 to f2048, accompanied by a 'y' column denoting data labels, which are binary values of 0 or 1. Similarly, the 2-feature datasets consist of columns f1, f2, and a 'y' column representing data labels, with values of 0 or 1. All twelve datasets, comprising six with 2048 features each and six with 2 features each, consist of a total of 120 samples.

Funding Agency
Australian Research Council
Grant Number
Discovery Project-DP210102761