Human Proteome and Peptides with up to 2 missed cleavages.


We crawled large amounts of biomedical articles from PubMed for the keyphrase extraction system evaluation.

The articles, that consist of title, abstract and keyphrases provided by the authors, are used for the experiments.

In our paper, cancer-related biomedical articles are selected.


Each document in the dataset consists of title, abstract and keyphrases provided by the authors.


To develop a non-invasive assessment tool using machine learning in supporting a timely, accurate diagnosis in the elderly, we created an annotated dataset of 668 tongue images collected from hospitalized geriatric patients in a tertiary hospital in Shanghai, China. Images were captured via a light-field camera using CIELAB color space (to simulate human visual perception) and then were manually labeled by a panel of subject matter experts after chart reviewing patients’ clinical information documented in the hospital’s information system. 




Specific subject area

Diagnosis – Image and text data analysis

Hospitalized geriatric patients are a highly heterogeneous group often with variable diseases and conditions. Physicians, and geriatricians especially, are devoted to seeking non-invasive testing tools to support a timely, accurate diagnosis. Chinese tongue diagnosis, mainly based on the color and texture of the tongue, offers a unique solution.

Type of data

Free-text document



Each patient has a folder with 1 face image, 1 tongue image, and 2 narrative documents. An additional summary formed by table is provided.

How data were acquired

We used a patented light-field camera (CN201520303463.5) called the intelligent mirror using CIE L*a*b* color space. Our data acquisition was handled in a standardized way (i.e., ensuring consistent sitting height and placement of the intelligent mirror) as much as possible.

Data format

The face and tongue images belong to raw data and were taken at 600 pixels per inch (about 42.3 µm per pixel) and saved as a *.jpg with minimum compression (10% compression max). One narrative document is annotated and contains the parameters generated by the intelligent mirror when creating the face and tongue images, and the other contains the annotation results from the expert panel (e.g., vital signs, clinical imaging examination, and laboratory indicators).

Parameters for data collection

The study was conducted at a Chinese tertiary, comprehensive hospital. We recruited hospitalized subjects (excluding minority groups or other sensitive or disempowered populations) in the Geriatrics Department beginning in January 1, 2019. Images were captured via a light-field camera using CIELAB color space (to simulate the human visual perception) and then were manually labeled by a panel of subject matter experts after chart reviewing patients’ clinical information documented in the hospital’s information system.

Description of data collection

Data acquisition and image annotation was conducted by subject matter experts including four fully credentialed senior-level physicians (i.e., associate chief physician and above), one resident, and two medical students. One medical student was in charge of data acquisition. The resident consolidated patients’ previous chronic medical history, clinical imaging examination, and laboratory indicators. One physician diagnosed patients’ constitutional types. Another physician gave a final admission diagnosis by considering the patient’s constitution based on both traditional Chinese medicine and Western medicine. Constitutional types are based on TCM analysis and differentiation of pathological conditions in accordance with the eight principal syndromes, namely 八纲辨证, including yin and yang (阴阳), exterior and interior (表里), cold and heat (寒热), and hypofunction and hyperfunction (虚实). All the information from the free-text data labeling was documented digitally by one medical student in Chinese and translated into English. The treatment plan corresponding to the admission diagnosis was reviewed and annotated by the remaining two physicians.

A total of 12 items must be merged into an annotated document, including various indices related to tongue diagnosis, physical or mental factors, clinicians’ observations, and more. To mitigate this, we used a previously designed algorithm to generate templates automatically. Under the K-means paradigm, our previously designed algorithm (1) embedded each annotated document into a vector representation for the first 200 patients, (2) partitioned those vectors into several (e.g., K=10) clusters, and (3) designated each cluster representative as a prototype template, or a vector of real annotated document closest to the centroid. For the remaining 468 patients, we used the specified prototype template to assist with the annotation.

Data source location

Shanghai, CHN

Cambridge, MA, USA


Schematic explanation of representations to brain networks during WM tasks. Left upper panel is the location illustration of four fitted sources. A-E present components relative to WM in terms of some specific neurocognitive processes.  A. During this duration, selective attention is activated by capitals’ trigger, which induced the attention mechanism in PPC cortex. B.


This dataset was used in the article "Dias-Audibert FL, Navarro LC, de Oliveira DN, Delafiori J, Melo CFOR, Guerreiro TM, Rosa FT, Petenuci DL, Watanabe MAE, Velloso LA, Rocha AR and Catharino RR (2020) Combining Machine Learning and Metabolomics to Identify Weight Gain Biomarkers. Front. Bioeng. Biotechnol. 8:6. doi: 10.3389/fbioe.2020.00006", open access available at:


WGMSML-Data folder contains the mass spectra input data for the Matlab scripts which are in WGMSML-MATLAB-SourceCode folder. WGMSML-ExecutionLogsAndPlots contains logs and plots generated by the execution of the Matlab code over the input data. Main scripts are enumerated in the order of execution.


This dataset contains in-silico results of insulin treatment using a fully automated artificial pancreas algorithm based on reinforcement learning for FDA-approved virtual patients (C. D. Man et al., 2014) with type 1 diabetes (10 adults and 10 adolescents). 


This database contains the results of an experiment were healthy subjects played 5 trials of a rehabilitation-based VR game, to experience either difficulty variations or presence variations.

Colected results are demogrpahic information, emotional emotions after each trial and electrophysiological signals during all 5 trials.


One of the materials that is commonly being used in electronics applications is paper. It is flexible, cheap, highly available, and allows for simple manufacturing when paired with methods such as screen printing or inkjet printing. Proposed below is an optogenetic device that uses paper as the sole substrate, with a screen­printed PCB with Ag/AgCl wires. This device was quick and easy to manufacture, unlike the state of the art optoelectronic devices that use polymers and rely on complex fabrication methods such as photolithography.


Research Article


This database contains the 166 Galvanic Skin Response (GSR) signal registers collected from the subjects participating in the first experiment (EXP 1) presented in:

R. Martinez, A. Salazar-Ramirez, A. Arruti, E. Irigoyen, J. I. Martin and J. Muguerza, "A Self-Paced Relaxation Response Detection System Based on Galvanic Skin Response Analysis," in IEEE Access, vol. 7, pp. 43730-43741, 2019. doi: 10.1109/ACCESS.2019.2908445


* GSR signals of each participant:The files whose names begin with letter A correspond to the GSR registers extracted from the participants. These files have a single column which correspond to the values of the GSR signal sampled at Fs=1Hz.* Labels of each signal:The files whose names begin with LABEL correspond to the labels of the RResp of each subject.These files have two columns. The first column corresponds to the label of the register and the second column corresponds to the timestamp for that given label. The registers have been labeled using 20s windows (sliding every 5s) and being the labels positioned in the center of the window. For example:-1 12.5  --> In the time window going from 2.5s to 22.5s, the RResp label corresponds to RResp=-1, being the  center of the window at 12.5s.There are four RResp intensity levels: 0 stands for the absence of any RResp, -1 for a Low intensity RResp, -2 for a Medium intensity RResp and -3 for a High intensity RResp.