The advent of the Industrial Internet of Things (IIoT) has led to the availability of huge amounts of data, that can be used to train advanced Machine Learning algorithms to perform tasks such as Anomaly Detection, Fault Classification and Predictive Maintenance. Most of them are already capable of logging warnings and alarms occurring during operation. Turning this data, which is easy to collect, into meaningful information about the health state of machinery can have a disruptive impact on the improvement of efficiency and up-time. The provided dataset consists of a sequence of alarms logged by packaging equipment in an industrial environment. The collection includes data logged by 20 machines, deployed in different plants around the world, from 2019-02-21 to 2020-06-17. There are 154 distinct alarm codes, whose distribution is highly unbalanced.


In this dataset, we provide both raw and processed data. As for raw data, raw/alarms.csv is a comma-separated file with a row for each logged alarm. Each row provides the alarm code, the timestamp of occurrence, and the identifier of the piece of equipment generating the alarm. From this file, it is possible to generate data for tasks such as those described in the abstract. For the sake of completeness, we also provide the Python code to process data and generate input and output sequences that can be used to address the task of predicting which alarms will occur in a future time window, given the sequence of all alarms occurred in a previous time window (processed/all_alarms.pickleprocessed/all_alarms.json, and processed/all_alarms.npz). The Python module to process raw data into input/output sequences is In particular, function create_dataset allows creating sequences already split in train/test and stored in a pickle file. It is also possible to use create_dataset_json and create_dataset_npz to obtain different output formats for the processed dataset. The ready-to-use datasets provided in the zipped folder were created by considering an input of 1720 minutes and an output window of 480 minutes. More information can be found in the attached file.



This is a repository of 102 smart home conflict scenarios, which were designated as conflict by actual human users. In other words, humans consider the scenarios below to be conflicts in a smart home environment. To see how to use this repository, and how the repository was collected, please read the following paper:


Each conflict scenario is a sentence in English that can be processed by NLP or can be converted to some features.


Vehicular networks have various characteristics that can be helpful in their inter-relations identifications. Considering that two vehicles are moving at a certain speed and distance, it is important to know about their communication capability. The vehicles can communicate within their communication range. However, given previous data of a road segment, our dataset can identify the compatibility time between two selected vehicles. The compatibility time is defined as the time two vehicles will be within the communication range of each other.


Each row contains characteristic information related to two vehicles at time t. Data set feature set (column headings) are as follows: 


- Euclidean Distance: The shortest distance between two vehicles in meters

- Relative Velocity: The velocity of 2nd vehicles as seen from 1st vehicle

- Direction Difference: Given the direction information of each vehicle, the direction difference feature identifies the angle both vehicles are moving towards. For instance, two vehicles going on the same road can have direction difference 0, whereas two vehicles moving in the opposite direction will have a difference of 180. we calculated direction difference using: |((Direction of i - Direction of j+ 180)%360 - 180)| .

- Direction Difference Label: To ease the process for the supervised learning model, we also included direction difference label information by identifying three possible directions ( 0 if difference < 60, 2 if difference >120 and 1 if none of above)

- Tendency: The Tendency is an interesting label that is required to differentiate between two vehicles which are moving in opposite directions, but either they are approaching each other or moving away from each other. 


Target Label (Compatibility time): Our goal is to identify how long two vehicles will be in the communication range of each other. The predicted compatibility time label tells us five possible values:

L0 means Compatibility Time is 0

L1 means Compatibility Time is more than 2 seconds but less than 5 seconds

L2 means Compatibility Time is more than 5 seconds but less than 10 seconds

L3 means Compatibility Time is more than 10 seconds but less than 15 seconds


L4 means Compatibility Time is more than 15 seconds 


Real-World Multimodal Foodlog Database (RWMF) database is built for evaluating the multimodal retrieval algorithm in real-life dietary environment, and it has 7500 multimodal pairs in total, where each image can be related to multiple texts and each text can be related to multiple images. Details of this database can be found in this paper: Pengfei Zhou, Cong Bai, Kaining Ying, Jie Xia, Lixin Huang, RWMF: Real-World Multimodal Foodlog Database, ICPR 2020


Since this is a multimodal database, the images in RWMF is related to texts by share the same tag, which is saved in `Foodhealth/im_label`

* `Foodlog`: the real-world food images and the associative instant bio-data
** `Image`: the folder that contains all the real-world foodlog images.
** `biodata.csv`: the csv file that contains all the associative instant bio-data, these data are associated to food images by the file names of images.
** `biodata.txt`: the txt that indicate the attributes of each column in `biodata.csv`.
** `data_category.csv`: the health category tags that help the model test the performance of cross-modal retrieval.
** `data_category.txt`: the txt that indicate the attributes of each column in `data_category.csv`.

* `Foodhealth`: the food description texts and the associative food nutrition composition data
** `description.csv`: the csv file that contains all the food description texts refered to each tag.
** `description.txt`: the txt file that indicate the attributes of each column in `description.csv`.
** `composition.csv`: the csv file that contains all the food nutrition composition data refered to each tag.
** `composition.txt`: the txt file that indicate the attributes of each column in `composition.csv`.
** `im_label.csv`: the csv file that contains all the tags related to each image.
** `im_label.txt`: the txt file that indicate the attributes of each column in `im_label.csv`.


The following data set is modelled after the implementers’ test data in 3GPP TS 33.501 “Security architecture and procedures for 5G System” with the same terminology. The data set corresponds to SUCI (Subscription Concealed Identifier) computation in the 5G UE (User Equipment) for IMSI (International Mobile Subscriber Identity) based SUPI (Subscription Permanent Identifier) and ECIES Profile A.


The following data set is modelled after the implementers’ test data in 3GPP TS 33.501 “Security architecture and procedures for 5G System” with the same terminology. The data set corresponds to SUCI (Subscription Concealed Identifier) computation in the 5G UE (User Equipment) for IMSI (International Mobile Subscriber Identity) based SUPI (Subscription Permanent Identifier) and ECIES Profile A, the IMSI consists of MCC|MNC: '274012'. 

In the 5G system, the globally unique 5G subscription permanent identifier is called SUPI as defined in 3GPP TS 23.501. For privacy reasons, the SUPI from the 5G devices should not be transferred in clear text, and is instead concealed inside the privacy preserving SUCI. Consequently, the SUPI is privacy protected over-the-air of the 5G radio network by using the SUCI. For SUCIs containing IMSI based SUPI, the UE in essence conceals the MSIN (Mobile Subscriber Identification Number) part of the IMSI. On the 5G operator-side, the SIDF (Subscription Identifier De-concealing Function) of the UDM (Unified Data Management) is responsible for de-concealment of the SUCI and resolves the SUPI from the SUCI based on the protection scheme used to generate the SUCI. 

The SUCI protection scheme used in this data set is ECIES Profile A. The size of the scheme-output is a total of 256-bit public key, 64-bit MAC & 40-bit encrypted MSIN. The SUCI scheme-input MSIN is coded as hexadecimal digits using packed BCD coding where the order of digits within an octet is same as the order of MSIN. As the MSINs are odd number of digits, bits 5 to 8 of final octet is coded as ‘1111’.  

# Example Python code to load data into Spark DataFrame

df ="csv").option("inferSchema","true").option("header","true").option("sep",",").load(“5g_suci_using_ecies_profile_a_100k.gz”)


Vibration measurement on SAG mill drive motor for Energy harvesting or predictive maintenance


Presented here is a dataset used for our SCADA cybersecurity research. The dataset was built using our SCADA system testbed described in our paper below [*]. The purpose of our testbed was to emulate real-world industrial systems closely. It allowed us to carry out realistic cyber-attacks.



Provided dataset is cleased, pre-processed, and ready to use. The users may modify as they wish, but please cite the dataset as below.

M. A. Teixeira, M. Zolanvari, R. Jain, "WUSTL-IIOT-2018 Dataset for ICS (SCADA) Cybersecurity Research," 2018. [Online]. Available:


We took advantage of the prototype to compare the performances of an LwM2M device management protocol implementation and FIWARE’s Ultralight 2.0. In addition to demonstrating the viability of the proposed approach, the obtained results point to mixed advantages/disadvantages of one protocol over the other.


Message Queuing Telemetry Transport (MQTT) protocol is one of the most recent standards used in Internet of Things (IoT) machine to machine communication. The increase in the number of available IoT devices and used protocols reinforce the need for new and robust Intrusion Detection Systems (IDS). However, building IoT IDS requires the availability of datasets to process, train and evaluate these models. The dataset presented in this paper is the first to simulate and MQTT-based network. The dataset is generated using a simulated MQTT network architecture.


The dataset consists of 5 pcap files, namely, normal.pcap, sparta.pcap, scan_A.pcap, mqtt_bruteforce.pcap and scan_sU.pcap. Each file represents a recording of one scenario; normal operation, Sparta SSH brute-force, aggressive scan, MQTT brute-force and UDP scan respectively. The attack pcap files contain background normal operations. The attacker IP address is “”. Basic packet features are extracted from the pcap files into CSV files with the same pcap file names. The features include flags, length, MQTT message parameters, etc. Later, unidirectional and bidirectional features are extracted.  It is important to note that for the bidirectional flows, some features (pointed as *) have two values—one for forward flow and one for the backward flow. The two features are recorded and distinguished by a prefix “fwd_” for forward and “bwd_” for backward. 



The demo data set consists the propagation path distances of AT & T North America Netowork Topology. The geographical node positions (latitude and longitude) along with the adjacency matrix has been found out from International Topology Zoo and the data set has been formed using the available data. This set has been used in Joint localization prolem of Controller and Hypervisor instances in vSDN enebled 5G Network.