Dataset of article: Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools

Citation Author(s):: Sandro Mendonça (Universidade Federal do Pará)

Yvan Brito (Universidade Federal do Pará)

Carlos Gustavo Resque dos Santos (Universidade Federal do Pará)

Bianchi Serique Meiguins (Universidade Federal do Pará)
Submitted by:: Carlos Santos
Last updated:: Fri, 03/13/2020 - 21:19
DOI:: 10.21227/5aeq-rr34
Data Format:: *.csv

1130 views

Categories:

Standards Research Data

Keywords:

Synthetic Dataset Generator

Benchmark Datasets Creation

Data Creation System

CITE

Abstract

Dataset used in the article entitled 'Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools'. These datasets can be used to test several characteristics in machine learning and data processing algorithms.

Instructions:

The dataset has basically 2 dimensions, one for class and one for the features. The variations are specified on top of a default dataset, which has the following characteristics:

1.000 entries
No outliers
No missing values
Two dimensions (one relevant feature and one class, no bad features)
80\% Class separation
Two Classes
No Class Imbalance

Thus, six types of datasets were generated, one for each of the six characteristics in the default dataset. In each type of dataset, the system generated four datasets with slight differences in the associated characteristic. For instance, to vary the effect of the number of outliers, the system created datasets with 10\%, 20\%, 30\%, and 40\% of outliers, without changing the other characteristics. The variations of the characteristics are the following:

Amount of outliers: [10\%, 20\%, 30\%, 40\%, 50\%]
Class separation: [100\%, 90\%, 80\%, 70\%, 60\%]
Amount of missing values: [10\%, 20\%, 30\%, 40\%, 50\%]
Class imbalance: [50\%-50\%, 40\%-60\%, 30\%-70\%, 20\%-80\%, 10\%-90\%]
Bad features: [1-1, 1-3, 1-5, 1-7, 1-9]
Amount of classes: [2, 12, 22, 32, 42]