big data analytics

Testing Results from Manuscript "Exploring the Potential of Offline LLMs in Data Science: A Study on Code Generation for Data Analysis"

This is a dataset that contains the testing results presented in the manuscript "Exploring the Potential of Offline LLMs in Data Science: A Study on Code Generation for Data Analysis", and it aims to assess offline LLMs' capabilities in code generation for data analytics tasks. Best utilization of the dataset would occur after thorough understanding of the manuscript. A total of 250 testing results were generated for each of the two LLMs evaluated. They were merged, leading to the creation of this current dataset.

Categories:

Artificial Intelligence

geospatial vector data used in HiVQ

the dataset includes geospatial vector point and linestring data, and the data size ranges from 4 million records to 100 million records to evaluate the applicability of HiVQ.

Categories:

The Surface Accelerations Reference

The Surface Accelerations Reference is a catalog of all longitudinal and lateral accelerations experienced by SHRP2-NDS participants. The Strategic Highway Research Program Naturalistic Driving Study (SHRP2-NDS) is the largest naturalistic driving study in the world constituting of 34.5 million miles of recorded driving data. To create the surface accelerations reference, each and every acceleration event in SHRP2-NDS was detected, summarized, and recorded creating a database of more than 1.7 billion data points.

Categories:

Transportation

A 24-hour signal recording dataset with labels for cybersecurity and IoT

The dataset contains:
1. We conducted a A 24-hour recording of ADS-B signals at DAB on 1090 MHz with USRP B210 (8 MHz sample rate). In total, we got the signals from more than 130 aircraft.
2. An enhanced gr-adsb, in which each message's digital baseband (I/Q) signals and metadata (flight information) are recorded simultaneously. The output file path can be specified in the property panel of the ADS-B decoder submodule.
3. Our GnuRadio flow for signal reception.
4. Matlab code of the paper, wireless device identification using the zero-bias neural network.

Categories:

CO2 dataset

We obtained 6 million instances to be used as an analysis for modelling CO2 behavior. The Data Logging and sensors nodes acquisition are every 1 second.

Categories:

Big Data Machine Learning Benchmark on Spark

We introduce a benchmark of distributed algorithms execution over big data. The datasets are composed of metrics about the computational impact (resource usage) of eleven well-known machine learning techniques on a real computational cluster regarding system resource agnostic indicators: CPU consumption, memory usage, operating system processes load, net traffic, and I/O operations. The metrics were collected every five seconds for each algorithm on five different data volume scales, totaling 275 distinct datasets.

Categories:

Mechanical Parts data cost data and shape cluster

Dataset Ⅰ：To obtain the prices of parts from the manufacturing characteristics and other manufacturing processes, feature quantity expression is innovatively applied. By identifying manufacturing features and calculating the feature quantities, the feature quantities are described in the form of assignments as data. To obtain the prices of parts intelligently, the most widely used and mature deep-learning method is adopted to realize the accurate quotation of parts.

Categories:

Testing Results from Manuscript "Exploring the Potential of Offline LLMs in Data Science: A Study on Code Generation for Data Analysis"

geospatial vector data used in HiVQ

The Surface Accelerations Reference

A 24-hour signal recording dataset with labels for cybersecurity and IoT

Category

CO2 dataset

Big Data Machine Learning Benchmark on Spark

Category

Mechanical Parts data cost data and shape cluster