Abstract

The dataset has Gaussian Blobs of varying samples, centers and features. The number of samples ranges from 500 to 50,000. Similarly, the number of centers varies from 2 to 100, while the number of features varies from 2 to 2048. These different sets of Gaussian blobs can be used for testing clustering algorithms for their scalability and effectiveness. There are two kinds of files inside the compressed sets. Files ending with "_X.csv" consist of datapoints, while the files ending with "_y.csv" represent respective class data.

The filename of each gaussian blob inside compressed sets gives a sketch of the blob. For example, the file "s50000_c50_f2048_X.csv" contains 50,000 samples of data that have 2048 dimensions (features) with 50 centers, and the file "s50000_c50_f2048_y.csv" is the associated class data of the file "s50000_c50_f2048_X.csv". The blob files are organized based on their number of samples. For example, the compressed file "10,000 datapoints set.zip" contains a collection of Gaussian blobs with 10,000 samples of data with a varying number of centers and features. The documentation section has PDF document that provides list of files inside each compressed file.

The naming convention of the files uses following alphabets that represent the content of the repective file.

s represents number of samples

c represents number of centers

f represents number of features

Instructions:

Please go through the documentation file before downloading the compressed zips. The PDF contains list of files that are within each compressed file.

The datapoints have real numbers up to 15 decimal places. The algorithm might converge, taking a long time because of such decimal precision. So if you need to round off the numbers, you can do that through DataFrameName.round(decimals=decimal_place).

Dataset Files

LOGIN TO ACCESS DATASET FILES
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.