Datasets
Standard Dataset
Binary classifiers' outputs for ensemble creation
- Citation Author(s):
- Submitted by:
- Attila Tiba
- Last updated:
- Fri, 05/31/2019 - 04:47
- DOI:
- 10.21227/7pf8-nq83
- Data Format:
- License:
- Categories:
- Keywords:
Abstract
This dataset was created based on the paper 'Andras Hajdu, Gyorgy Terdik, Attila Tiba, and Henrietta Toman: A stochastic approach to handle knapsack problems in the creation of ensembles'.To summarize our experimental setup for UCI binary classification problems, we have considered base classifiers perceptron, decision tree, Levenberg-Marquardt feedforward neural network, random neural network, and discriminative restricted Boltzmann machine classifier for the 5 UCI datasets MAGIC Gamma Telescope, HIGGS, EEG EyeState, Musk (Version 2), and Spambase; datasets of large cardinalities were selected to be able to train synthetic variants of base classifiers on different subsets.To check our models for different numbers of possible ensemble members, the respective pool sizes were set to 30 and 100; the necessary number of classifiers has been reached via synthesizing the base classifiers with training them on different subsets of the training part of the given datasets.
The folder data_30 contains 5 .csv files corresponding to the 5 UCI datasets.
Each .csv file contains the classification results of the 30 classifiers as follows:
- each row corresponds to a line from the corresponding UCI dataset,
- each column in the range 1-30 represents the class label predicted by a given classifier,
- column 31 represents the ground truth label of the given case,
- column 32 represents the line index of the given case from the corresponding UCI dataset,
- the first 30% of the rows contains the results on the test set, the last 70% on the training one.
The folder data_100 contains 5 .csv files corresponding to the 5 UCI datasets.
Each .csv file contains the classification results of the 100 classifiers as follows:
- each row corresponds to a line from the corresponding UCI dataset,
- each column in the range 1-100 represents the class label predicted by a given classifier,
- column 101 represents the ground truth label of the given case
- column 102 represents the line index of the given case from the corresponding UCI dataset,
- the first 30% of the rows contains the results on the test set, the last 70% on the training one.