Public datasets for GBSR

Name: Public datasets for GBSR
Creator: JianHua Peng
License: https://creativecommons.org/licenses/by/4.0/
Keywords: Machine Learning

Citation Author(s):: Jianhua Peng
Submitted by:: JianHua Peng
Last updated:: Tue, 06/11/2024 - 13:55
DOI:: 10.21227/dj5a-qh26
Data Format:: *.avi; *.csv; *.txt
Links:: UCI

ELVIRA biomedical database

face image datasets

bankDirectMarketing

smartGridStabilityAugmented

106 views

Categories:

Machine Learning

Keywords:

Attribute reduction

Artificial Intelligence; Dataset; Machine Learning; Feature Selection

ACCESS DATASET CITE

Abstract

We use a total of 16 datasets, detailed descriptions of which are provided in Table II. Among them, 11 datasets are from the UCI database, the DLBCL-Harvard dataset is from the ELVIRA biomedical database, Yale and ORL
are face image datasets, and lung and ALLAML are biological datasets. The bankDirectMarketing and smartGridStabilityAugmented datasets are from the Kaggle platform. These datasets are commonly used for attribute reduction and sourced from various application domains; they exhibit a wide range of sample sizes and attribute counts. The diverse nature of these datasets helps demonstrate the effectiveness and generalization performance of the algorithms. In the experiments, all data are normalized to the [0, 1] range to eliminate the influence of dimensions.

Instructions:

The collection comprises 16 diverse datasets, including machine learning datasets, image datasets, and face recognition datasets. In our experimental process, we will apply min-max normalization to the datasets to ensure that the data values are scaled within a specific range, typically between 0 and 1. The primary objective of our experiments is to explore and evaluate various attribute reduction algorithms. These algorithms aim to reduce the number of attributes or features in the datasets while maintaining or improving the performance of the machine learning models. By focusing on attribute reduction, we aim to enhance the efficiency and effectiveness of our models, potentially leading to faster processing times and improved accuracy. Python will be the programming language of choice for implementing and testing these attribute reduction algorithms. Through these experiments, we hope to gain deeper insights into the performance and applicability of different attribute reduction techniques across various types of datasets.

The data set is as follows:

NO Dataset Samples Attributes Labels1 glass 214 10 6 2 Libras Movement 360 90 15 3 UrbanLandCover 168 147 9 4 ionosphere 351 34 2 5 wdbc 569 30 2 6 spambase 4601 57 2 7 gamma 19020 10 2 8 DLBCL-Harvard 77 7129 2 9 Yale 165 1024 15 10 lung 203 3312 5 11 ORL 400 1024 40 12 ALLAML 72 7129 2 13 bankDirectMarketing 41188 19 2 14 smartGridStabilityAugmented 60000 13 2 15 Htru2 17898 8 216 sensorless 58509 48 11