Collecting and analysing heterogeneous data sources from the Internet of Things (IoT) and Industrial IoT (IIoT) are essential for training and validating the fidelity of cybersecurity applications-based machine learning. However, the analysis of those data sources is still a big challenge for reducing high dimensional space and selecting important features and observations from different data sources.
One of the major research challenges in this field is the unavailability of a comprehensive network based data set which can reflect modern network traffic scenarios, vast varieties of low footprint intrusions and depth structured information about the network traffic. Evaluating network intrusion detection systems research efforts, KDD98, KDDCUP99 and NSLKDD benchmark data sets were generated a decade ago. However, numerous current studies showed that for the current network threat environment, these data sets do not inclusively reflect network traffic and modern low footprint attacks.
Boğaziçi University DDoS dataset (BOUN DDoS) is generated in Boğaziçi University via Hping3 traffic generator software by flooding TCP SYN, and UDP packets. This dataset includes attack-free user traffic as well as attack traffic and suitable for evaluating network-based DDoS detection methods. Attacks are towards one victim server connected to the backbone router of the campus. Attack packets have randomly generated spoofed source IP addresses. The data-trace was recorded on the backbone and included over 4000 active hosts.
Bo ğaziçi University DDoS dataset (BOUN DDoS) is generated in Bo ğaziçi University via Hping3 traffic generator software
by flooding TCP SYN, and UDP packets. This dataset includes attack-free user traffic as well as attack traffic and suitable for
evaluating network-based DDoS detection methods. Attacks are towards one victim server connected to the backbone router of
the campus. Attack packets have randomly generated spoofed source IP addresses. The data-trace was recorded on the backbone
and included over 4000 active hosts.
The dataset includes two different attack scenarios. In both scenarios, randomly generated spoofed IP addresses are used in
a flooding manner. For TCP flood attacks, TCP port 80 is used as the destination port. All of the datasets lasted 8 minutes.
In each of them, 80 seconds waiting period, then 20 seconds attack period is practiced. Different packet rates are used to let
researchers evaluate their detection methods concerning different packets rates.
The TCP SYN Flood and UDP flood datasets include attack rates of 1000, 1500, 2000 and 2500 packets/second. The
topology of the attack is given in Figure 1.
Fig. 1. BOUN DDoS attack topology.
Attack packets can be distinguished from attack-free packets using the destination IP address of packets. The victim IP
address is 10.50.199.86.
II. DATASET STRUCTURE
Datasets are in comma-separated value file format, and have the following columns:
Time: Time values start from zero and have a resolution of 0.000001 seconds. Time values are expressed in seconds.
Frame Number: Frame number is simply the incremental count of packets in the dataset.
Frame length: Frame length is the length of that packet in bytes.
Source ip: Source IP address of the packet.
Destination IP: Destination Ip address of the packet.
Source Port: Source TCP port of the packet. If it is not a TCP packet, this field is empty.
Destination Port: Destination TCP port of the packet. If it is not a TCP packet, this field is empty
SYN: This value is “Set” if the packet is a TCP packet and its SYN flag is equal to one, it is equal to “Not Set” if the
packet is a TCP packet and its SYN flag is equal to zero. If the packet is not a TCP packet, this field is empty.
ACK: This value is “Set” if the packet is a TCP packet and its ACK flag is equal to one, it is equal to “Not Set” if the
packet is a TCP packet and its ACK flag is equal to zero. If the packet is not a TCP packet, this field is empty.
RST: This value is “Set” if the packet is a TCP packet and its RST flag is equal to one, it is equal to “Not Set” if the
packet is a TCP packet and its RST flag is equal to zero. If the packet is not a TCP packet, this field is empty.
TTL: Time to live value of the packets.
TCP Protocol: This value can be TCP or UDP if the packet belongs to a transport layer IP protocol. Else this value can
have different values.
Typically, a paper mill comprises three main stations: Paper machine, Winder station, and Wrapping station. The Paper machine produces paper with particular grammage in gsm (gram per square meter). The typical grammage classes in our paper mill are 48 gsm, 50 gsm, 58 gsm, 60 gsm, 68 gsm, 70 gsm. The Winder station takes a paper spool that is about 6 m width as it’s input and transfers is to customized paper rolls with particular diameter and width.
This dataset is related to the paper “Quantification of feature importance in automatic classification of power quality distortions” (IEEE International Conference on Harmonics and Quality of Power, March 2020). It includes the features extracted from synthetic signals with power quality distortions obtained from a public model (doi: 10.1109/ICHQP.2018.8378902).
This database contains the results of an experiment were healthy subjects played 5 trials of a rehabilitation-based VR game, to experience either difficulty variations or presence variations.
Colected results are demogrpahic information, emotional emotions after each trial and electrophysiological signals during all 5 trials.
Multi-modal Exercises Dataset is a multi- sensor, multi-modal dataset, implemented to benchmark Human Activity Recognition(HAR) and Multi-modal Fusion algorithms. Collection of this dataset was inspired by the need for recognising and evaluating quality of exercise performance to support patients with Musculoskeletal Disorders(MSD).The MEx Dataset contains data from 25 people recorded with four sensors, 2 accelerometers, a pressure mat and a depth camera.
The MEx Multi-modal Exercise dataset contains data of 7 different physiotherapy exercises, performed by 30 subjects recorded with 2 accelerometers, a pressure mat and a depth camera.
The dataset can be used for exercise recognition, exercise quality assessment and exercise counting, by developing algorithms for pre-processing, feature extraction, multi-modal sensor fusion, segmentation and classification.
Data collection method
Each subject was given a sheet of 7 exercises with instructions to perform the exercise at the beginning of the session. At the beginning of each exercise the researcher demonstrated the exercise to the subject, then the subject performed the exercise for maximum 60 seconds while being recorded with four sensors. During the recording, the researcher did not give any advice or kept count or time to enforce a rhythm.
Obbrec Astra Depth Camera
- sampling frequency – 15Hz
- frame size – 240x320
Sensing Tex Pressure Mat
- sampling frequency – 15Hz
- frame size – 32*16
Axivity AX3 3-Axis Logging Accelerometer
- sampling frequency – 100Hz
- range – 8g
All the exercises were performed lying down on the mat while the subject wearing two accelerometers on the wrist and the thigh. The depth camera was placed above the subject facing down-words recording an aerial view. Top of the depth camera frame was aligned with the top of the pressure mat frame and the subject’s shoulders such that the face will not be included in the depth camera video.
MEx folder has four folders, one for each sensor. Inside each sensor folder,
30 folders can be found, one for each subject. In each subject folder, 8 files can be found for each exercise with 2 files for exercise 4 as it is performed on two sides. (The user 22 will only have 7 files as they performed the exercise 4 on only one side.) One line in the data files correspond to one timestamped and sensory data.
The 4 columns in the act and acw files is organized as follows:
1 – timestamp
2 – x value
3 – y value
4 – z value
Min value = -8
Max value = +8
The 513 columns in the pm file is organized as follows:
1 - timestamp
2-513 – pressure mat data frame (32x16)
Min value – 0
Max value – 1
The 193 columns in the dc file is organized as follows:
1 - timestamp
2-193 – depth camera data frame (12x16)
dc data frame is scaled down from 240x320 to 12x16 using the OpenCV resize algorithm
Min value – 0
Max value – 1
Even though intelligent systems such as Siri or Google Assistant are enjoyable (and useful) dialog partners, users can only access predefined functionality. Enabling end-users to extend the functionality of intelligent systems will be the next big thing. To promote research in this area we carried out an empirical study on how laypersons teach robots new functions by means of natural language instructions. The result is a labeled corpus consisting of 3168 submissions given by 870 subjects.
The Corpus consist of three datasets
- The raw dataset of submissions (without labels): raw_dataset.csv
- The labeled dataset: labeled_dataset.csv
- Personal data of the participants as provided by Prolific (Caution: Information are incomplete since registered members provide it voluntarily): personal_infomation_prolific.csv