Datasets
Standard Dataset
A Batch of Integer Data Sets for Clustering Algorithms
- Citation Author(s):
- Submitted by:
- Nuno Paulino
- Last updated:
- Tue, 05/17/2022 - 22:17
- DOI:
- 10.21227/smta-vv06
- Data Format:
- Research Article Link:
- License:
- Categories:
- Keywords:
Abstract
This is a simple batch of data sets of points containing only integer attributes. The data sets were generated with a randomly correlated data set generator (DOI: 10.13140/RG.2.2.34866.43200).
This batch includes a total of 12 data sets which can be used to validate implementations of clustering algorithms such as k-nearest neighbours, or k-means.
# A Batch of Integer Data Sets for Clustering Algorithms
# Description
This is a simple batch of data sets of points containing only integer attributes.
The data sets were generated with a randomly correlated data set generator (DOI:10.13140/RG.2.2.34866.43200).
This batch includes a total of 12 data sets, for all possible combinations of the following parameter ranges:
- N (number of points) = {32k, 64k}
- K (number of clusters) = {8, 16}
- D (number of attributes) = {2, 8, 16}
The data sets are named accordingly, e.g., *N32_K8_D2.txt*. Also included are secondary files with one instance of possible initial centroid values per data set, computed by a run of the *k-means++* algorithm. These values were employed in the research paper referred to at the end of this README.
Data set files are formated such that each row is a complete datum, where attributes are separated by **whitespace**.
## Copyright
Copyright 2020 SPeCS.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. under the License.
## Related Research Papers
Paulino et al, 2020, *Optimizing OpenCL Code for Performance on FPGA: k-means Case Study with Integer Data Sets*, IEEEAccess
## Authors
Nuno Paulino - [ResearchGate](https://www.researchgate.net/profile/Nuno_Paulino2)