This is a simple batch of data sets of points containing only integer attributes. The data sets were generated with a randomly correlated data set generator (DOI: 10.13140/RG.2.2.34866.43200).

This batch includes a total of 12 data sets which can be used to validate implementations of clustering algorithms such as k-nearest neighbours, or k-means.

Categories:
73 Views

In this paper, we present a collaborative recommend system that recommends elective courses for students based on similarities of student’s grades obtained in the last semester. The proposed system employs data mining techniques to discover patterns between grades. Consequently, we have noticed that clustering students into similar groups by performing clustering. The data set is processed for clustering in such a way that it produces optimal number of clusters.

Categories:
241 Views

 

 

Instructions: 

The .zip file contains 6 folders when unzipped. We provide the details of each folder below.

 

“Proteins” folder: Contains 20 protein targets organized into two folders (Benchmark and CASP) depending on the family each target belongs to. Data for each protein is provided in a subfolder named with its id. Each such subfolder contains the following 4 files.

  1. A .fasta file containing the amino-acid sequence of the protein.

  2. A .pdb file containing the native tertiary structure coordinates. Detailed format for a .pdb file can be found in http://www.wwpdb.org/documentation/file-format

  3. A .frag3 file containing the fragments of length 3 for the protein sequence generated from http://old.robetta.org/

  4. A .frag9 file containing the fragments of length 9 for the protein sequence generated from http://old.robetta.org/

 

“Generation” folder: Contains the generated ensembles for the protein targets in 20 subfolders, one for each target, named with their ids. Each subfolder contains 5 files, each containing the generated ensemble for one run. Each such file contains 14 columns and each row represents one generated structure. The first column provides the Rosetta score4 energy, the second column provides the lRMSD to the native structure, and each of the rest of the 12 columns provides one USR feature for the structure.

 

“Reduced” folder: Contains the reduced ensembles for each clustering technique in separate folders. Each such folder contains 20 subfolders, one for each target, named with their ids. Each such subfolder contains 5 files, each containing the reduced ensemble for one run. Each such file contains 2 columns and each row represents one structure in the reduced ensemble. The first column provides the Rosetta score4 energy and the second column provides the lRMSD to the native structure.

 

“Truncation” folder: Contains the reduced ensembles via truncation for the protein targets in 20 subfolders, one for each target, named with their ids. Each such subfolder contains 5 files, each containing the reduced ensemble for one run. Each such file contains 2 columns and each row represents one structure in the reduced ensemble. The first column provides the Rosetta score4 energy and the second column provides the lRMSD to the native structure.

 

“Ks” folder: Contains 4 separate files, one for each clustering technique, containing the number of clusters for each run of each protein target. These files can be used to plot the distributions for the number of clusters.

 

“Bars” folder: Contains 3 separate subfolders containing the information needed to plot the bar charts for the minimum, average, and standard deviation of lRMSDs to the native structure for the CASP targets. Each subfolder contains 10 files, one for each target. Each file contains 6 rows that provide the lRMSD value for original ensemble, reduced ensemble for hierarchical clustering, reduced ensemble for k-means clustering, reduced ensemble for GMM clustering, reduced ensemble for gmx-cluster clustering, and reduced ensemble for truncation, respectively.

Categories:
80 Views

CUPSNBOTTLES is an object data set, recorded by a mobile service robot. There are 10 object classes, each with a varying number of samples. Additionally, there is a clutter class, containing samples where the object detector failed.

Instructions: 

Download and extract the ZIP file containing all files. There is python code available (under 'scripts') to easily load the data set. Other programming languages should also handle .jpg, .hdf and .csv files for easy access. For easy access with python, a pickle dump file has been added. This has no extra information compared to the .csv file.

Categories:
126 Views

Motor point identification is pivotal to elicit comfortable and sustained muscle contraction through functional electrical stimulation. To this purpose, anatomical charts and manual search techniques are used to extract subject-specific stimulation profile. Such information being heterogenous they lack standardization and reproducibility. To address these limitations; we aim to identify, localize, and characterize the motor points of forearm muscles across nine healthy subjects.

Categories:
462 Views

Cluster analysis, which focuses on the grouping and categorization of similar elements, is widely used in various fields of research. Inspired by the phenomenon of atomic fission, this paper proposes  a novel density-based clustering algorithm, called fission clustering (FC). It focuses on mining the dense families of clusters in the dataset and utilizes the information of the distance matrix to fissure the dataset into subsets.

Categories:
162 Views