CSV | IEEE DataPort

Big Data Machine Learning Benchmark on Spark

We introduce a benchmark of distributed algorithms execution over big data. The datasets are composed of metrics about the computational impact (resource usage) of eleven well-known machine learning techniques on a real computational cluster regarding system resource agnostic indicators: CPU consumption, memory usage, operating system processes load, net traffic, and I/O operations. The metrics were collected every five seconds for each algorithm on five different data volume scales, totaling 275 distinct datasets.

Categories:

Category

Learning deep representations for video-based intake gesture detection

Video dataset of 102 participants for the paper "Learning deep representations for video-based intake gesture detection"

Categories:

Health

DISE DELHI PRIMARY TO UPPER-PRIMARY LEVEL SCHOOLS IN ACADEMIC SESSION 2011-2012

Archival bundle of District Information System for Education (DISE) Delhi primary to upper-primary level schools in academic session 2012-2013. DISE is a school-level dataset consisting of government-recognized schools. It is a joint initiative of the Government of India, UNICEF and the National University of Education and Planning (NUPEA).

Categories:

Education

Simulated Boiler Fault Data

Matlab Simulink was used to develop an emulator for the Viessmann Vitorond 200 Gas Fired Boiler VD2 Series 380 and a series of faults were modeled along with normal data across the expected range of operation to create a labelled dataset with approximately 27,500 cases for training and testing boiler fault classification models.

Categories:

Category

Energy

Python algorithms and dataset of empirical line method applied to inland water hyperspectral images combining reference targets and in situ water measurements

Empirical line methods (ELM) are frequently used to correct images from aerial remote sensing. Remote sensing of aquatic environments captures only a small amount of energy because the water absorbs much of it. The small signal response of the water is proportionally smaller when compared to the other land surface targets.

This dataset presents some resources and results of a new approach to calibrate empirical lines combining reference calibration panels with water samples. We optimize the method using python algorithms until reaches the best result.

Categories:

Category

A distributed Fog node assessment model by using Fuzzy rules learned by XGBoost

The dataset is used in the paper entitled "A distributed Fog node assessment model by using Fuzzy rules learned by XGBoost" as fuzzy rules extracted by XGboost

Categories:

Artificiality

This in an artificial imbalanced data set.

Categories:

Standards Research Data

A distributed Front-end Edge node assessment model by using Fuzzy and a learning-to-rank method-machine learning

The dataset is used in machine learning method of the "A distributed Front-end Edge node assessment model by using Fuzzy and a learning-to-rank method" paper

Categories:

A distributed Front-end Edge node assessment model by using Fuzzy and a learning-to-rank method

This dataset is related to the paper "A distributed Front-end Edge node assessment method by using a learning-to-rank method"

Categories:

Industrial Machines Dataset for Electrical Load Disaggregation

This dataset contains heavy-machinery data from the Brazilian industrial sector. The data was collected in a poultry feed factory located in the state of Minas Gerais, Brazil. Its process can be summarized to creating pellets of ration for poultry from corn or soybeans and added nutrients. The factory produces at fullscale over the entire year, thus it has well-behaved usage patterns at any time. It operates from Mondays through Fridays (and occasionally on Saturdays, in case production is below the monthly target) on a daily three-turn shift from 10:00 PM to 05:00 PM.

Categories: