The mean shift (MS) algorithm is a nonparametric method used to cluster sample points and find the local modes of kernel density estimates, using an idea based on iterative gradient ascent. In this paper we develop a mean-shift-inspired algorithm to estimate the modes of regression functions and partition the sample points in the input space. We prove convergence of the sequences generated by the algorithm and derive the non-asymptotic rates of convergence of the estimated local modes for the underlying regression model.


Biomolecular structure data analyzed in "Space Partitioning and Regression Mode Seeking via a Mean-Shift-Inspired Algorithm" by Wanli Qiao and Amarda Shehu.


Silk fibroin is the structural fiber of the silk filament and it is usually separated from the external fibroin by a chemical process called degumming. This process consists in an alkali bath in which the silk cocoons are boiled for a determined time. It is also known that the degumming process impacts the property of the outcoming silk fibroin fibers.


The data contained in the first sheet of the dataset is in tidy format (each row correspond to an observation) and can be directly imported in R and elaborated with the package Tidyverse. It should be noticed that the row with the standard order 49 correspond to the reference degumming while the row 50 correspond to the test made on the bare silk fiber (not degummed). In this last case neither the mass loss nor the secondary structures were determined. In fact, being not degummed the sericine was surrounding the fiber so the examination of the secondary structure could not be done. The first two column of the dataset represent the Standard order (the standard order in which the Design of Experiment data are elaborated) and the Run order (the randomized order in whcih the trials were performed). The next four columns are the Studied factors while the rest of the dataset reports the process yields (in this case, the properties of the outcoming silk fibers). 

The second sheet contains the information of the molecular weight of the tested samples. In this case only one sample for each triplicate was tested. Both the standard order and the run order referred to the same samples of the first sheet. 


Feature tables and source code for Camargo et al. A Machine Learning Strategy for Locomotion Classification and Parameter Estimation using Fusion of Wearable Sensors. Transactions on Biomedical Engineering. 2021


The feature tables used for this paper can be found in ‘’ and ‘’, while source code is found in ‘’. To get started, download all the files into a single folder and unzip them. Within ‘CombinedLocClassAndParamEst-master’, the folder ‘sf_analysis’ contains the main code to run, split into ‘Classification’ and ‘Regression’ code folders. There is also a '' file within the source code with more information and dependencies. If you’d like to just regenerate plots and results from the paper, then move all contents of the ‘zz_results_published’ folders (found under the feature table folders) up one folder so they are just within the ‘Classification’ or ‘Regression’ data folders. Go into the source code, find the ‘analysis’ folders, and run any ‘analyze*.m’ script with updated ‘datapath’ variables to point to the results folders you just moved.


This data resource is an outcome of the NSF RAPID project titled "Democratizing Genome Sequence Analysis for COVID-19 Using CloudLab" awarded to University of Missouri-Columbia.

The resource contains the output of variant analysis (along with CADD scores) on human genome sequences obtained from the COVID-19 Data Portal. The variants include single nucleotide polymorphisms (SNPs) and short insert and deletes (indels).


1. Download a .zip file.

2. Unzip the file and extract it into a folder. 

3. There will be two folders, namely, VCF and CADD_Scores. These folders contain the compressed .vcf and .tsv files. The .vcf files are filtered VCF files produced by the GATK best practice workflow for RNA-seq data. The reference genome hg19 was used. There is also a .xlsx file containing the run accession IDs (e.g., SRR12095153) and URLs (e.g., from where the paired end sequences were downloaded. Complete description of the sequences can be found via these URLs.

4. Check for new .zip files.


Human Neck movements data acquired using Meatwear - CPRO device - Accelerometer-based Kinematic data. Data fed to OpenSim simulation software extracted Kinematics and Kinetics (Muscles, joints - Forces, Acceleration, Position)


The dataset is collected for the purpose of investigating how brainwave signals can be used to industrial insider threat detection. The dataset was connected using Emotiv Insight 5 channels device. The dataset contains data from 17 subjects who accepted to participate in this data collection.


The Magnetic Resonance – Computed Tomography (MR-CT) Jordan University Hospital (JUH) dataset has been collected after receiving Institutional Review Board (IRB) approval of the hospital and consent forms have been obtained from all patients. All procedures followed are consistent with the ethics of handling patients’ data.


The Magnetic Resonance – Computed Tomography (MR-CT) Jordan University Hospital (JUH) dataset has been collected after receiving Institutional Review Board (IRB) approval of the hospital and consent forms have been obtained from all patients. All procedures followed are consistent with the ethics of handling patients’ data.


Restricted mean survival time (RMST), recommended for reporting survival, lacks a tool to analyze multilevel factors. Gini's mean difference of RMSTs, Δ, is proposed and applied to compare a lymph node ratio-based classification (LNRc) versus a number-based classification (ypN) in stage II/III breast cancer patients prospectively enrolled to neoadjuvant chemotherapy who underwent axillary dissection. Number of positive nodes (npos) classified patients into ypN0, npos=0, ypN1, npos=[1,3], ypN2, npos=[4,9], and ypN3, npos≥10.


Breast cancer Neoadjuvant chemotherapy

1 header row.

370 data rows

columns = characteristics, refer to papers for detailed description



 Histopathological characterization of colorectal polyps allows to tailor patients' management and follow up with the ultimate aim of avoiding or promptly detecting an invasive carcinoma. Colorectal polyps characterization relies on the histological analysis of tissue samples to determine the polyps malignancy and dysplasia grade. Deep neural networks achieve outstanding accuracy in medical patterns recognition, however they require large sets of annotated training images.


In order to load the data, we provide below an example routine working within PyTorch frameworks. We provide two different resolutions, 800 and 7000 um/px.

Within each resolution, we provide .csv files, containing all metadata information for all the included files, comprising:

  • image_id;
  • label (6 classes - HP, NORM, TA.HG, TA.LG, TVA.HG, TVA.LG);
  • type (4 classes - HP, NORM, HG, LG);
  • reference WSI;
  • reference region of interest in WSI (roi);
  • resolution (micron per pixels, mpp);
  • coordinates for the patch (x, y, w, h).

Below you can find the dataloader class of UNITOPatho for PyTorch. More examples can be found here.

import torch

import torchvision

import numpy as np

import cv2

import os


class UNITOPatho(

def __init__(self, df, T, path, target, subsample=-1, gray=False, mock=False):

self.path = path

self.df = df

self.T = T = target

self.subsample = subsample

self.mock = mock

self.gray = gray

allowed_target = ['type', 'grade', 'top_label']

if target not in allowed_target:

print(f'Target must be in {allowed_target}, got {target}')


print(f'Loaded {len(self.df)} images')

def __len__(self):

return len(self.df)

def __getitem__(self, index):

entry = self.df.iloc[index]

image_id = entry.image_id

image_id = os.path.join(self.path, entry.top_label_name, image_id)

img = None

if self.mock:

C = 1 if self.gray else 3

img = np.random.randint(0, 255, (224, 224, C)).astype(np.uint8)


img = cv2.imread(image_id)

if self.subsample != -1:

w = img.shape[0]

while w//2 > self.subsample:

img = cv2.resize(img, (w//2, w//2))

w = w//2

img = cv2.resize(img, (self.subsample, self.subsample))

if self.gray:

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

img = np.expand_dims(img, axis=2)


img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

if self.T is not None:

img = self.T(img)

return img, entry[]