This dataset is captured from a Mirai type botnet attack on an emulated IoT network in OpenStack. Detailed information on the dataset is depicted in the following work. Please cite it when you use this dataset for your research.
Kalupahana Liyanage Kushan Sudheera, Dinil Mon Divakaran, Rhishi Pratap Singh, and Mohan Gurusamy, "ADEPT: Detection and Identification of Correlated Attack-Stages in IoT Networks," in IEEE Internet of Things Journal.
The dataset contains:
1. We conducted a A 24-hour recording of ADS-B signals at DAB on 1090 MHz with USRP B210 (8 MHz sample rate). In total, we got the signals from more than 130 aircraft.
2. An enhanced gr-adsb, in which each message's digital baseband (I/Q) signals and metadata (flight information) are recorded simultaneously. The output file path can be specified in the property panel of the ADS-B decoder submodule.
3. Our GnuRadio flow for signal reception.
4. Matlab code of the paper, wireless device identification using the zero-bias neural network.
1. The "main.m" in Matlab code is the entry of simulation.
2. The "csv2mat" is a CPP program to convert raw records (adsb_records1.zip) of our gr-adsb into matlab manipulatable format. Matio library (https://github.com/tbeu/matio) is required.
3. The Gnuradio flowgraph is also provided with the enhanced version of gr-adsb, in which you are supposed to replace the original one (https://github.com/mhostetter/gr-adsb). And, you can specify an output file path in the property panel of the ADS-B decoder submodule.
4. Related publication: Zero-Bias Deep Learning for Accurate Identification of Internet of Things (IoT) Devices, IEEE IoTJ (accepted for publication on 21 August 2020), DOI: 10.1109/JIOT.2020.3018677
This work contains data gathered by a series of sensors (PM 10, PM 2.5, temperature, relative humidity, and pressure) in the city of Turin in the north part of Italy (more precisely, at coordinates 45.041903N, 7.625850E). The data has been collected for a period of 5 months, from October 2018 to February 2019. The scope of the study was to address the calibration of low-cost particulate matter sensors and compare the readings against official measures provided by the Italian environmental agency (ARPA Piemonte).
A Densely-Deployed, High Sampling Rate, Open-Source Air Pollution Monitoring WSN
Documentation for the air pollution monitoring station developed at Politecnico di Torino by:
Edoardo Giusto, Mohammad Ghazi Vakili under the supervision of Prof. Bartolomeo Montrucchio.
This section includes a description of our architecture from several points of view, going from the hardware and software architecture, to the communication protocols.
We target the following key characteristics of our system:
- The rapid and easy prototyping capabilities,
- Flexibility in connection scenarios, and
- Cheapness but also dependability of components.
As each board has to include a limited number of modules, to facilitate our prototype development, we select the
Raspberry Pi single-board computer as a monitoring board.
Due to our constraints in terms of cost, size and power consumption we select its
Zero Wireless version based on the
The basic operating principle of the system is the following. The data gathered from the sensors are stored in the
MicroSD card of the RPi. At certain time intervals the RPi tries to connect to a
Wi-Fi network and, if such a connection is established, it uploads the newly acquired data to a remote server.
The creation of the Wi-Fi network is achieved using a mobile phone set to operate as personal hot-spot, while on the remote server resides the database storing all the performed measurements.
Wi-Fi connectivity was one of the requirements for the system, but at the same time, the system itself should have not to produce unnecessary electromagnetic noise, possibly impacting the operating ability of the host's appliances.
To reduce the time in which the Wi-Fi connection was active, the
Linux OS was set to activate the specific interface at predefined time instants in order to connect to the portable hot-spot.
Once connected to the network, the system performed the following tasks:
- synchronization of the system and
RTC clockwith a remote Network Time Protocol (NTP) server,
- synchronization of
the local samples directorywith the
remote directoryresiding on the server.
The latter task is performed using the
UNIX rsyncutility, which has to be installed on both the machines.
To gather data from the sensors, a Python program has been implemented, which runs continuously with a separate process reading from each physical sensor plugged to the board and writing on the MicroSD card.
It has to be noted that for what concerns the PM sensors, since the UART communication had to take place using GPIOs, a Pigpiod deamon has been leveraged, to create digital serial ports over the Pi's pins.
The directories on the remote server are a simple copy of the MicroSD cards mounted on the boards.
Data in these directories have been inserted in a MySQL database.
Mechanical Design and Hardware Components
In order to easily stack more than one device together, a 3D printed modular case has been designed.
Several enclosing frames can be tied together using nuts and bolts, with the use of a single cap on top.
Figure shows the 3D board design, together with the final sensor and board configurations.
Each platform is equipped with 4 PM sensors (a good trade-off between size and redundancy),
1 Temperature (T) and
Relative Humidity (HT) sensor and
1 Pressure (P) sensor.
As our target was to capture significant data sampling for the particulate matter we adopt the following sensors:
Honeywell HPMA115S0-XXXas PM sensor.
As one of our targets was to evaluate these sensors' suitability for air pollution monitoring applications, we insert 4 instances of this sensor in every single platform.
This sort of redundancy allows us to detect strange phenomena and to avoid several kind of malfunctions, making more stable the overall system.
DHT22as temperature and relative humidity sensor.
This is very widespread in prototyping applications, with several open-source implementation of its library, publicly available on the internet.
BME280as a pressure sensor.
This is a cheap but precise barometric pressure and temperature sensor which comes pre-soldered on a small PCB for easy prototyping.
The system also includes a
Real Time Clock (RTC) module for the operating system to retrieve the correct time after a sudden power loss. The chosen device is the
The DS3231 communicates via I2C interface and has native support in the Linux kernel.
As a last comment, notice that a Printed Circuit Board (PCB) was designed to facilitate connections and soldering of the various sensors and other components.
The database structure can be created using the scripts located in the
mysql_insertion folder of the
mysql -u <user> [-h <host>] [-p] < create_db.sql
Load SQL data (SQL Format)
Data formated in SQL can be loaded using the mysql command
mysql -u username -p WEATHER_STATION < db_whole_data.sql, and the
db_whole_data.sql is available in the
SQL_data/ folder of the
Load RAW data (CSV)
Data can be loaded using the python script
sql_ins.py available in the
mysql_insertion folder of the
python sql_ins.py <data_folder>
The script assumes the following folder structure:
Each folder contains a set of csv files. The script automatically loads data into the appropriate table and using the correct fields, which are specified as a list of parameters in the script. It is possible to edit the script to load only a subset of the folders.
To replicate the experiments, the user should clone the raspberry pi image into a MicroSD (16-32 GB).
To do this, s/he can issue the command
dd if=/path/to/image of=/path/of/microsd bs=4m on Linux.
The sampling scripts are run by a systemd unit automatically at system startup. The same systemd unit handles also the automatic respawn of the processes if some problems occur. The data are stored in the
/home/alarm/ws/data directory, with filenames corresponding to the date of acquisition.
In order to upload these data to a database, it is possible to use the guide contained in the "database" directory.
In order to perform calibration and tests, it is recommended to take a look at the guide contained in the "analysis" directory. A Python class has been implemented to perform calibration of sensors against the ARPA reference ones. The resulting calibration can then be applied to a time window of choice.
3D model of the case has been developed using
SketchUp online software.
The resulting model is split in 5 different parts, each large enough to fit in our
3D printer (Makerbot Replicator 2X).
The model is stackable, meaning that several cases can be put on top of each other, with a single roof piece.
Printed Circuit Board
PCB has been developed using
KiCad software, so to create a hat for the RPi0 connecting all the sensors.
WS Analysis library documentation (v0.2)
The aim of this package is to provide fast and easy access and analysis to the Weather Station database. This package is located in the
analysis directory, and it is compatible only with Python 3. Please follow the readme file for more information.
│ ├── Cap_v0_1stpart.skp
│ ├── Cap_v0_2dpart.skp
│ ├── ws_rpzero_noGPS_v1.skp
│ ├── ws_sensors_2d_half_v2.skp
│ └── ws_sensors_half_v2.skp
│ ├── arpa_station.json
│ ├── board.json
│ ├── example.py
│ ├── extract.py
│ ├── out.pdf
│ ├── requirements.txt
│ ├── ws_analysis
│ │ ├── __pycache__
│ │ │ └── ws_analysis.cpython-37.pyc
│ │ ├── rpt.txt
│ │ └── script_offset.py
│ ├── ws_analysis.md
│ ├── ws_analysis.pdf
│ ├── ws_analysis.py
│ └── ws_analysis.pyc
│ ├── db_setup.html
│ ├── db_setup.md
│ ├── db_setup.pdf
│ ├── er_diagram.pdf
│ ├── mysql_insertion
│ │ ├── extract_to_file.py
│ │ ├── remove_duplicate.py
│ │ └── sql_ins.py
│ ├── SQL_Table
│ │ ├── create_db.sql
│ │ ├── create_measure_table.sql
│ │ └── load_data.sql
│ └── SQL_data
│ └── db_whole_data.sql.gz
│ └── WS_v2_output.tar.xz
│ ├── csv
│ │ ├── arpa_retrieve.py
│ │ ├── filemerge.py
│ │ ├── gpx2geohash.py
│ │ ├── parse_csv.py
│ │ └── validation.py
│ └── mpu9250
│ └── gyro.py
This dataset contains the database of the transport block (TB) configurations .
The advent of the Industrial Internet of Things (IIoT) has led to the availability of huge amounts of data, that can be used to train advanced Machine Learning algorithms to perform tasks such as Anomaly Detection, Fault Classification and Predictive Maintenance. Most of them are already capable of logging warnings and alarms occurring during operation. Turning this data, which is easy to collect, into meaningful information about the health state of machinery can have a disruptive impact on the improvement of efficiency and up-time. The provided dataset consists of a sequence of alarms logged by packaging equipment in an industrial environment. The collection includes data logged by 20 machines, deployed in different plants around the world, from 2019-02-21 to 2020-06-17. There are 154 distinct alarm codes, whose distribution is highly unbalanced.
In this dataset, we provide both raw and processed data. As for raw data, raw/alarms.csv is a comma-separated file with a row for each logged alarm. Each row provides the alarm code, the timestamp of occurrence, and the identifier of the piece of equipment generating the alarm. From this file, it is possible to generate data for tasks such as those described in the abstract. For the sake of completeness, we also provide the Python code to process data and generate input and output sequences that can be used to address the task of predicting which alarms will occur in a future time window, given the sequence of all alarms occurred in a previous time window (processed/all_alarms.pickle, processed/all_alarms.json, and processed/all_alarms.npz). The Python module to process raw data into input/output sequences is dataset.py. In particular, function create_dataset allows creating sequences already split in train/test and stored in a pickle file. It is also possible to use create_dataset_json and create_dataset_npz to obtain different output formats for the processed dataset. The ready-to-use datasets provided in the zipped folder were created by considering an input of 1720 minutes and an output window of 480 minutes. More information can be found in the attached readme.md file.
This is a repository of 102 smart home conflict scenarios, which were designated as conflict by actual human users. In other words, humans consider the scenarios below to be conflicts in a smart home environment. To see how to use this repository, and how the repository was collected, please read the following paper:
Each conflict scenario is a sentence in English that can be processed by NLP or can be converted to some features.
Vehicular networks have various characteristics that can be helpful in their inter-relations identifications. Considering that two vehicles are moving at a certain speed and distance, it is important to know about their communication capability. The vehicles can communicate within their communication range. However, given previous data of a road segment, our dataset can identify the compatibility time between two selected vehicles. The compatibility time is defined as the time two vehicles will be within the communication range of each other.
Note: If you are using this then do cite our work. https://ieeexplore.ieee.org/abstract/document/9186099
F. H. Kumbhar and S. Y. Shin, "DT-VAR: Decision Tree Predicted Compatibility based Vehicular Ad-hoc Reliable Routing," in IEEE Wireless Communications Letters, doi: 10.1109/LWC.2020.3021430.
Each row contains characteristic information related to two vehicles at time t. Data set feature set (column headings) are as follows:
- Euclidean Distance: The shortest distance between two vehicles in meters
- Relative Velocity: The velocity of 2nd vehicles as seen from 1st vehicle
- Direction Difference: Given the direction information of each vehicle, the direction difference feature identifies the angle both vehicles are moving towards. For instance, two vehicles going on the same road can have direction difference 0, whereas two vehicles moving in the opposite direction will have a difference of 180. we calculated direction difference using: |((Direction of i - Direction of j+ 180)%360 - 180)| .
- Direction Difference Label: To ease the process for the supervised learning model, we also included direction difference label information by identifying three possible directions ( 0 if difference < 60, 2 if difference >120 and 1 if none of above)
- Tendency: The Tendency is an interesting label that is required to differentiate between two vehicles which are moving in opposite directions, but either they are approaching each other or moving away from each other.
Target Label (Compatibility time): Our goal is to identify how long two vehicles will be in the communication range of each other. The predicted compatibility time label tells us five possible values:
L0 means Compatibility Time is 0
L1 means Compatibility Time is more than 2 seconds but less than 5 seconds
L2 means Compatibility Time is more than 5 seconds but less than 10 seconds
L3 means Compatibility Time is more than 10 seconds but less than 15 seconds
L4 means Compatibility Time is more than 15 seconds
Real-World Multimodal Foodlog Database (RWMF) database is built for evaluating the multimodal retrieval algorithm in real-life dietary environment, and it has 7500 multimodal pairs in total， where each image can be related to multiple texts and each text can be related to multiple images. Details of this database can be found in this paper: Pengfei Zhou, Cong Bai, Kaining Ying, Jie Xia, Lixin Huang, RWMF: Real-World Multimodal Foodlog Database, ICPR 2020
Since this is a multimodal database, the images in RWMF is related to texts by share the same tag, which is saved in `Foodhealth/im_label`
* `Foodlog`: the real-world food images and the associative instant bio-data
** `Image`: the folder that contains all the real-world foodlog images.
** `biodata.csv`: the csv file that contains all the associative instant bio-data, these data are associated to food images by the file names of images.
** `biodata.txt`: the txt that indicate the attributes of each column in `biodata.csv`.
** `data_category.csv`: the health category tags that help the model test the performance of cross-modal retrieval.
** `data_category.txt`: the txt that indicate the attributes of each column in `data_category.csv`.
* `Foodhealth`: the food description texts and the associative food nutrition composition data
** `description.csv`: the csv file that contains all the food description texts refered to each tag.
** `description.txt`: the txt file that indicate the attributes of each column in `description.csv`.
** `composition.csv`: the csv file that contains all the food nutrition composition data refered to each tag.
** `composition.txt`: the txt file that indicate the attributes of each column in `composition.csv`.
** `im_label.csv`: the csv file that contains all the tags related to each image.
** `im_label.txt`: the txt file that indicate the attributes of each column in `im_label.csv`.
The following data set is modelled after the implementers’ test data in 3GPP TS 33.501 “Security architecture and procedures for 5G System” with the same terminology. The data set corresponds to SUCI (Subscription Concealed Identifier) computation in the 5G UE (User Equipment) for IMSI (International Mobile Subscriber Identity) based SUPI (Subscription Permanent Identifier) and ECIES Profile A.
The following data set is modelled after the implementers’ test data in 3GPP TS 33.501 “Security architecture and procedures for 5G System” with the same terminology. The data set corresponds to SUCI (Subscription Concealed Identifier) computation in the 5G UE (User Equipment) for IMSI (International Mobile Subscriber Identity) based SUPI (Subscription Permanent Identifier) and ECIES Profile A, the IMSI consists of MCC|MNC: '274012'.
In the 5G system, the globally unique 5G subscription permanent identifier is called SUPI as defined in 3GPP TS 23.501. For privacy reasons, the SUPI from the 5G devices should not be transferred in clear text, and is instead concealed inside the privacy preserving SUCI. Consequently, the SUPI is privacy protected over-the-air of the 5G radio network by using the SUCI. For SUCIs containing IMSI based SUPI, the UE in essence conceals the MSIN (Mobile Subscriber Identification Number) part of the IMSI. On the 5G operator-side, the SIDF (Subscription Identifier De-concealing Function) of the UDM (Unified Data Management) is responsible for de-concealment of the SUCI and resolves the SUPI from the SUCI based on the protection scheme used to generate the SUCI.
The SUCI protection scheme used in this data set is ECIES Profile A. The size of the scheme-output is a total of 256-bit public key, 64-bit MAC & 40-bit encrypted MSIN. The SUCI scheme-input MSIN is coded as hexadecimal digits using packed BCD coding where the order of digits within an octet is same as the order of MSIN. As the MSINs are odd number of digits, bits 5 to 8 of final octet is coded as ‘1111’.
# Example Python code to load data into Spark DataFrame
df = spark.read.format("csv").option("inferSchema","true").option("header","true").option("sep",",").load(“5g_suci_using_ecies_profile_a_100k.gz”)