MQTT-IoT-IDS2020: MQTT Internet of Things Intrusion Detection Dataset

Citation Author(s):
Hanan
Hindy
Abertay University
Christos
Tachtatzis
University of Strathclyde
Robert
Atkinson
University of Strathclyde
Ethan
Bayne
Abertay University
Xavier
Bellekens
University of Strathclyde
Submitted by:
Hanan Hindy
Last updated:
Mon, 08/31/2020 - 12:43
DOI:
10.21227/bhxy-ep04
Data Format:
License:
5
4 ratings - Please login to submit your rating.

Abstract 

Message Queuing Telemetry Transport (MQTT) protocol is one of the most used standards used in Internet of Things (IoT) machine to machine communication. The increase in the number of available IoT devices and used protocols reinforce the need for new and robust Intrusion Detection Systems (IDS). However, building IoT IDS requires the availability of datasets to process, train and evaluate these models. The dataset presented in this paper is the first to simulate an MQTT-based network. The dataset is generated using a simulated MQTT network architecture. The network comprises twelve sensors, a broker, a simulated camera, and an attacker. Five scenarios are recorded: (1) normal operation, (2) aggressive scan, (3) UDP scan, (4) Sparta SSH brute-force, and (5) MQTT brute-force attack.  The raw pcap files are saved, then features are extracted. Three abstraction levels of features are extracted from the raw pcap files: (a) packet features, (b) Unidirectional flow features and (c) Bidirectional flow features. The csv feature files in the dataset are suited for Machine Learning (ML) usage. Also, the raw pcap files are suitable for the deeper analysis of MQTT IoT networks communication and the associated attacks. 

Instructions: 

The dataset consists of 5 pcap files, namely, normal.pcap, sparta.pcap, scan_A.pcap, mqtt_bruteforce.pcap and scan_sU.pcap. Each file represents a recording of one scenario; normal operation, Sparta SSH brute-force, aggressive scan, MQTT brute-force and UDP scan respectively. The attack pcap files contain background normal operations. The attacker IP address is “192.168.2.5”. Basic packet features are extracted from the pcap files into CSV files with the same pcap file names. The features include flags, length, MQTT message parameters, etc. Later, unidirectional and bidirectional features are extracted.  It is important to note that for the bidirectional flows, some features (pointed as *) have two values—one for forward flow and one for the backward flow. The two features are recorded and distinguished by a prefix “fwd_” for forward and “bwd_” for backward. 

 

Comments

really very interesting, thank very much

Submitted by Randi Rizal on Sat, 10/31/2020 - 04:16

Great work!
Thank you very much for sharing.

Submitted by kahraman kostas on Wed, 01/13/2021 - 11:09

how i can download csv file of MQTT dataset?

Submitted by Khizra Arooj on Wed, 01/27/2021 - 10:13

Please find the packet_features.zip on the same page, it contains CSV files.

Submitted by Muhammad Khan on Wed, 03/10/2021 - 01:41

how i can download csv file of MQTT dataset?

Submitted by Khizra Arooj on Wed, 01/27/2021 - 10:13

Thank you for your interest in the dataset. The features can be found in the packet_features.zip (in the dataset files section). Also, you can find more details of the experiments in our published paper "https://link.springer.com/chapter/10.1007/978-3-030-64758-2_6" and the code is on GitHub "https://github.com/AbertayMachineLearningGroup/MQTT_ML"

Submitted by Hanan Hindy on Tue, 05/25/2021 - 19:20

Hello!

I'm working on Anomaly Detection on my Master's Thesis, but with an emphasis on physical signals (more Fault Detection, less Intrusion Detection), which would involve, in this case, the signals from the sensors themselves. I would, therefore, ignore all data regarding packet metadata (source, origin, message size, etc.). Would you consider that this dataset has enough information on what the sensors are picking?

Thank you!

Submitted by Carlos Pinto on Wed, 05/12/2021 - 08:16

Hello Carlos,
Thank you for your interest in the dataset. Unfortunately, the MQTT-IoT-IDS2020's focuses more on the data transferred between the sensors (specifically in protocol-based attacks vs generic attacks), not the sensors data/messages themselves. Sensor messages are randomly generated.
You can find more details of the experiments in our published paper "https://link.springer.com/chapter/10.1007/978-3-030-64758-2_6" and the code is on GitHub "https://github.com/AbertayMachineLearningGroup/MQTT_ML".
Finally, I would suggest that you reuse the simulated network architecture to generate data for your experiments.

I hope this helps!

Submitted by Hanan Hindy on Tue, 05/25/2021 - 19:25

Hello

Thanks for this data set! It is definitely needed. For bidirectional flow generation, did you use a specific tool to identify TCP/UDP flows and calculate the features or did you manually convert the raw packet data?

Submitted by James Brown on Mon, 07/12/2021 - 08:38

Hello James,
Thank you for your interest in the dataset. I manually extracted the features from the raw packets using python, specifically dpkt package.

The code is available on GitHub.

I hope this helps!

Submitted by Hanan Hindy on Fri, 07/16/2021 - 14:04