A Batch of Integer Data Sets for Clustering Algorithms

Citation Author(s):
Nuno
Paulino
INESC TEC
Submitted by:
Nuno Paulino
Last updated:
Tue, 05/17/2022 - 22:17
DOI:
10.21227/smta-vv06
Data Format:
Research Article Link:
License:
546 Views
Categories:
Keywords:
4
1 rating - Please login to submit your rating.

Abstract 

This is a simple batch of data sets of points containing only integer attributes. The data sets were generated with a randomly correlated data set generator (DOI: 10.13140/RG.2.2.34866.43200).

This batch includes a total of 12 data sets which can be used to validate implementations of clustering algorithms such as k-nearest neighbours, or k-means.

Instructions: 

# A Batch of Integer Data Sets for Clustering Algorithms

# Description

This is a simple batch of data sets of points containing only integer attributes.

The data sets were generated with a randomly correlated data set generator (DOI:10.13140/RG.2.2.34866.43200).

This batch includes a total of 12 data sets, for all possible combinations of the following parameter ranges:

 

- N (number of points) = {32k, 64k}

- K (number of clusters) = {8, 16}

- D (number of attributes) = {2, 8, 16}

 

The data sets are named accordingly, e.g., *N32_K8_D2.txt*. Also included are secondary files with one instance of possible initial centroid values per data set, computed by a run of the *k-means++* algorithm. These values were employed in the research paper referred to at the end of this README.

Data set files are formated such that each row is a complete datum, where attributes are separated by **whitespace**.

 

## Copyright

Copyright 2020 SPeCS.
 
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. under the License.

 

## Related Research Papers

Paulino et al, 2020, *Optimizing OpenCL Code for Performance on FPGA: k-means Case Study with Integer Data Sets*, IEEEAccess

 

## Authors

Nuno Paulino - [ResearchGate](https://www.researchgate.net/profile/Nuno_Paulino2)