Datasets
Open Access
Dataset of article: Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools
- Citation Author(s):
- Submitted by:
- Carlos Santos
- Last updated:
- Fri, 03/13/2020 - 17:19
- DOI:
- 10.21227/5aeq-rr34
- Data Format:
- License:
1068 Views
- Categories:
- Keywords:
0 ratings - Please login to submit your rating.
Abstract
Dataset used in the article entitled 'Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools'. These datasets can be used to test several characteristics in machine learning and data processing algorithms.
Instructions:
The dataset has basically 2 dimensions, one for class and one for the features. The variations are specified on top of a default dataset, which has the following characteristics:
- 1.000 entries
- No outliers
- No missing values
- Two dimensions (one relevant feature and one class, no bad features)
- 80\% Class separation
- Two Classes
- No Class Imbalance
Thus, six types of datasets were generated, one for each of the six characteristics in the default dataset. In each type of dataset, the system generated four datasets with slight differences in the associated characteristic. For instance, to vary the effect of the number of outliers, the system created datasets with 10\%, 20\%, 30\%, and 40\% of outliers, without changing the other characteristics. The variations of the characteristics are the following:
- Amount of outliers: [10\%, 20\%, 30\%, 40\%, 50\%]
- Class separation: [100\%, 90\%, 80\%, 70\%, 60\%]
- Amount of missing values: [10\%, 20\%, 30\%, 40\%, 50\%]
- Class imbalance: [50\%-50\%, 40\%-60\%, 30\%-70\%, 20\%-80\%, 10\%-90\%]
- Bad features: [1-1, 1-3, 1-5, 1-7, 1-9]
- Amount of classes: [2, 12, 22, 32, 42]
Dataset Files
- Dataset with 10% of missing values missing10porcento1000.csv (22.53 kB)
- Dataset with 20% of missing values missing20porcento1000.csv (22.75 kB)
- Dataset with 30% of missing values missing30porcento1000.csv (22.85 kB)
- Dataset with 40% of missing values missing40porcento1000.csv (22.46 kB)
- Dataset with balanced classes desbalanceamento5050-1000.csv (23.48 kB)
- Dataset with imbalance of 60% for one class desbalanceamento6040-1000.csv (23.54 kB)
- Dataset with imbalance of 70% for one class desbalanceamento7030-1000.csv (23.57 kB)
- Dataset with imbalance of 80% for one class desbalanceamento8020-1000.csv (23.62 kB)
- Dataset with 1 good feature and 1 bad feature 1-1features1000.csv (43.32 kB)
- Dataset with 1 good feature and 3 bad feature 1-3features1000.csv (80.89 kB)
- Dataset with 1 good feature and 7 bad feature 1-7features1000.csv (118.58 kB)
- Dataset with 1 good feature and 9 bad feature 1-9features1000.csv (156.27 kB)
- Dataset with 2 classes 2classes1000.csv (23.49 kB)
- Dataset with 12 classes 12classes1000.csv (22.98 kB)
- Dataset with 22 classes 22classes1000.csv (23.50 kB)
- Dataset with 32 classes 32classes1000.csv (23.70 kB)
- Dataset with 10 of outliers outlier10porcento1000.csv (23.51 kB)
- Dataset with 20 of outliers outlier20porcento1000.csv (23.47 kB)
- Dataset with 30 of outliers outlier30porcento1000.csv (23.44 kB)
- Dataset with 40 of outliers outlier40porcento1000.csv (23.48 kB)
- Dataset with 10% of overlap between classes separacao90porcento1000.csv (23.50 kB)
- Dataset with 20% of overlap between classes separacao80porcento1000.csv (23.47 kB)
- Dataset with 30% of overlap between classes separacao70porcento1000.csv (23.48 kB)
- Dataset with 40% of overlap between classes separacao60porcento1000.csv (23.50 kB)
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.