Silk fibroin is the structural fiber of the silk filament and it is usually separated from the external fibroin by a chemical process called degumming. This process consists in an alkali bath in which the silk cocoons are boiled for a determined time. It is also known that the degumming process impacts the property of the outcoming silk fibroin fibers.


The data contained in the first sheet of the dataset is in tidy format (each row correspond to an observation) and can be directly imported in R and elaborated with the package Tidyverse. It should be noticed that the row with the standard order 49 correspond to the reference degumming while the row 50 correspond to the test made on the bare silk fiber (not degummed). In this last case neither the mass loss nor the secondary structures were determined. In fact, being not degummed the sericine was surrounding the fiber so the examination of the secondary structure could not be done. The first two column of the dataset represent the Standard order (the standard order in which the Design of Experiment data are elaborated) and the Run order (the randomized order in whcih the trials were performed). The next four columns are the Studied factors while the rest of the dataset reports the process yields (in this case, the properties of the outcoming silk fibers). 

The second sheet contains the information of the molecular weight of the tested samples. In this case only one sample for each triplicate was tested. Both the standard order and the run order referred to the same samples of the first sheet.

In the Raw file the raw mechanical curves are reported in OriginLab format  divided in datasheets numbered as the sample from 1 to 48 with the additions of a datasheet for the reference curves obtained form the Rokwood protocol and the curves form the raw cocoons. 

In the same archive a file with the GPC curves and their elaborations for the tested samples are reported. 


Feature tables and source code for Camargo et al. A Machine Learning Strategy for Locomotion Classification and Parameter Estimation using Fusion of Wearable Sensors. Transactions on Biomedical Engineering. 2021


The feature tables used for this paper can be found in ‘’ and ‘’, while source code is found in ‘’. To get started, download all the files into a single folder and unzip them. Within ‘CombinedLocClassAndParamEst-master’, the folder ‘sf_analysis’ contains the main code to run, split into ‘Classification’ and ‘Regression’ code folders. There is also a '' file within the source code with more information and dependencies. If you’d like to just regenerate plots and results from the paper, then move all contents of the ‘zz_results_published’ folders (found under the feature table folders) up one folder so they are just within the ‘Classification’ or ‘Regression’ data folders. Go into the source code, find the ‘analysis’ folders, and run any ‘analyze*.m’ script with updated ‘datapath’ variables to point to the results folders you just moved.


This data resource is an outcome of the NSF RAPID project titled "Democratizing Genome Sequence Analysis for COVID-19 Using CloudLab" awarded to University of Missouri-Columbia.

The resource contains the output of variant analysis (along with CADD scores) on human genome sequences obtained from the COVID-19 Data Portal. The variants include single nucleotide polymorphisms (SNPs) and short insert and deletes (indels).


1. Download a .zip file.

2. Unzip the file and extract it into a folder. 

3. There will be two folders, namely, VCF and CADD_Scores. These folders contain the compressed .vcf and .tsv files. The .vcf files are filtered VCF files produced by the GATK best practice workflow for RNA-seq data. The reference genome hg19 was used. There is also a .xlsx file containing the run accession IDs (e.g., SRR12095153) and URLs (e.g., from where the paired end sequences were downloaded. Complete description of the sequences can be found via these URLs.

4. Check for new .zip files.


Human Neck movements data acquired using Meatwear - CPRO device - Accelerometer-based Kinematic data. Data fed to OpenSim simulation software extracted Kinematics and Kinetics (Muscles, joints - Forces, Acceleration, Position)


The dataset is collected for the purpose of investigating how brainwave signals can be used to industrial insider threat detection. The dataset was connected using Emotiv Insight 5 channels device. The dataset contains data from 17 subjects who accepted to participate in this data collection.


The Magnetic Resonance – Computed Tomography (MR-CT) Jordan University Hospital (JUH) dataset has been collected after receiving Institutional Review Board (IRB) approval of the hospital and consent forms have been obtained from all patients. All procedures followed are consistent with the ethics of handling patients’ data.


The Magnetic Resonance – Computed Tomography (MR-CT) Jordan University Hospital (JUH) dataset has been collected after receiving Institutional Review Board (IRB) approval of the hospital and consent forms have been obtained from all patients. All procedures followed are consistent with the ethics of handling patients’ data.


From manuscript: Lymph Node Ratio after Neoadjuvant Chemotherapy for Stage II/III Breast Cancer: Prognostic Value Measured with Gini’s Mean Difference of Restricted Mean Survival Times.

Bhumsuk Keam, Olena Gorobets, Vincent Vinh-Hung, Seock-Ah Im.


Breast cancer Neoadjuvant chemotherapy

1 header row.

370 data rows

columns = characteristics, refer to papers for detailed description



 Histopathological characterization of colorectal polyps allows to tailor patients' management and follow up with the ultimate aim of avoiding or promptly detecting an invasive carcinoma. Colorectal polyps characterization relies on the histological analysis of tissue samples to determine the polyps malignancy and dysplasia grade. Deep neural networks achieve outstanding accuracy in medical patterns recognition, however they require large sets of annotated training images.


In order to load the data, we provide below an example routine working within PyTorch frameworks. We provide two different resolutions, 800 and 7000 um/px.

Within each resolution, we provide .csv files, containing all metadata information for all the included files, comprising:

  • image_id;
  • label (6 classes - HP, NORM, TA.HG, TA.LG, TVA.HG, TVA.LG);
  • type (4 classes - HP, NORM, HG, LG);
  • reference WSI;
  • reference region of interest in WSI (roi);
  • resolution (micron per pixels, mpp);
  • coordinates for the patch (x, y, w, h).

Below you can find the dataloader class of UNITOPatho for PyTorch. More examples can be found here.

import torch

import torchvision

import numpy as np

import cv2

import os


class UNITOPatho(

def __init__(self, df, T, path, target, subsample=-1, gray=False, mock=False):

self.path = path

self.df = df

self.T = T = target

self.subsample = subsample

self.mock = mock

self.gray = gray

allowed_target = ['type', 'grade', 'top_label']

if target not in allowed_target:

print(f'Target must be in {allowed_target}, got {target}')


print(f'Loaded {len(self.df)} images')

def __len__(self):

return len(self.df)

def __getitem__(self, index):

entry = self.df.iloc[index]

image_id = entry.image_id

image_id = os.path.join(self.path, entry.top_label_name, image_id)

img = None

if self.mock:

C = 1 if self.gray else 3

img = np.random.randint(0, 255, (224, 224, C)).astype(np.uint8)


img = cv2.imread(image_id)

if self.subsample != -1:

w = img.shape[0]

while w//2 > self.subsample:

img = cv2.resize(img, (w//2, w//2))

w = w//2

img = cv2.resize(img, (self.subsample, self.subsample))

if self.gray:

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

img = np.expand_dims(img, axis=2)


img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

if self.T is not None:

img = self.T(img)

return img, entry[]


This study presented six datasets for DNA/RNA sequence alignment for one of the most common alignment algorithms, namely, the Needleman–Wunsch (NW) algorithm. This research proposed a fast and parallel implementation of the NW algorithm by using machine learning techniques. This study is an extension and improved version of our previous work . The current implementation achieves 99.7% accuracy using a multilayer perceptron with ADAM optimizer and up to 2912 giga cell updates per second on two real DNA sequences with a of length 4.1 M nucleotides.