Gaussian Blobs of Varying numbers of samples, centers and features

Citation Author(s):
Sadiksha
Sharma
Submitted by:
Sadiksha sharma
Last updated:
Fri, 10/09/2020 - 02:08
DOI:
10.21227/gzrx-1t37
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

The dataset has Gaussian Blobs of varying samples, centers and features.  The number of samples ranges from 500 to 50,000. Similarly, the number of centers varies from 2 to 100, while the number of features varies from 2 to 2048. These different sets of Gaussian blobs can be used for testing clustering algorithms for their scalability and effectiveness. There are two kinds of files inside the compressed sets. Files ending with "_X.csv" consist of datapoints, while the files ending with "_y.csv" represent respective class data.

The filename of each gaussian blob inside compressed sets gives a sketch of the blob. For example, the file "s50000_c50_f2048_X.csv" contains 50,000 samples of data that have 2048 dimensions (features) with 50 centers, and the file "s50000_c50_f2048_y.csv" is the associated class data of the file "s50000_c50_f2048_X.csv". The blob files are organized based on their number of samples. For example, the compressed file "10,000 datapoints set.zip" contains a collection of Gaussian blobs with 10,000 samples of data with a varying number of centers and features. The documentation section has PDF documents that provide lists of files inside each compressed file.

Instructions: 

Please go through the documentation files (PDFs) before downloading the compressed zips. The PDFs contain lists of files that are within each compressed file.

The datapoints have real numbers up to 15 decimal places. The algorithm might converge, taking a long time because of such decimal precision. So if you need to round off the numbers, you can do that through DataFrameName.round(decimals=decimal_place).