To download the dataset click the link provided.  To unzip the file, double-click the zipped folder to open it. Then, drag or copy the item from the zipped folder to a new location.


We propose a driver pattern dataset consists of 51 features extracted from CAN (Controller Area Network) of Hyundai YF Sonata while four drivers drove city roads of Seoul, Republic of Korea. Under the belief that different driving patterns implicitly exist at CAN data, we collected CAN diagnosis data from four drivers in pursuit of research on driver identification, driver profiling, and abnormal driving behavior detection. Four drivers are named A, B, C, and D.



The dataset contains 51 features extracted from CAN along with numerous trips performed by four drivers. The four drivers drove along city roads of Seoul, the Republic of Korea. The recorded 51 features can be employed for driver identification, driver profiling, abnormal driving pattern identification, and any related tasks. Please check the abstract for a more detailed description.

CSV Files

Directory A, B, C and D contains .csv files of CAN data. Each .csv file represents a trip.


The names of 51 features are described in the features.pkl file. Please check the file for detailed information.


Park, Kyung Ho, and Huy Kang Kim. "This Car is Mine!: Automobile Theft Countermeasure Leveraging Driver Identification with Generative Adversarial Networks." arXiv preprint arXiv:1911.09870 (2019).

Park, Kyung Ho, and Huy Kang Kim. "This Car is Mine!: Automobile Theft Countermeasure Leveraging Driver Identification with Generative Adversarial Networks.", ESCAR Asia (2019)



This dataset represents the main different unique learning behaviors that may be found in any group of learners in e-learning/educational systems. It represents 20 learners through 17 OERs.


The dataset consists of two files:

1.OER Tracked Behavior.CSV

2.Course Tracked Behavior.CSV


This data set includes US November 2020 Election related Tweet messages that contain #USAelection or at least one of the following keywords about four party:

Keywords about Democratic Party:
@DNC OR @TheDemocrats OR Biden OR @JoeBiden OR "Our best days still lie ahead" OR "No Malarkey!"

Keywords about Green Party:
@GreenPartyUS OR @TheGreenParty OR “Howie Hawkins” OR @HowieHawkins OR “Angela Walker” OR @AngelaNWalker

Keywords about Libertarian Party:
@LPNational OR “Jo Jorgersen” OR @Jorgensen4POTUS OR “Spike Cohen” OR @RealSpikeCohen


Currently dataset contain 3,5 million tweets with 6 different attribute of each tweets that were sent from 1 July 2020 until 12 August 2020.

The data file contains comma separated values (CSV) which is zipped by WinRAR to upload and download easily. It contains the following information (6 Column) for each tweet in the data file:

Created-At: Exact creation time of the tweet
From-User-Id: Sender User Id
To-User-Id: if it is sent to a user, its user ID
Language: Language of tweets that are coded in ISO 639-1. %91,7 of tweets en: English; %3,9 und: Unidentified; %2,15 es: Spanish.
Retweet-Count: number of retweets
Id: ID of tweet that is unique for all tweets

This data can be used for prediction of election result by using sentiment analysis and prediction analytics. Also, text mining such as topic modelling can be used to understand main issues that twitter users concern about us election.


2D geometrically shaped constellations that are simultaneously robust to both residual phase noise (RPN) and AWGN (named as xx(x)M-RPN, where xx(x) is the receiver type) for 8 to 256-ary PCAWGN reception. The presented formats are optimised at the indicative bit-wise achievable information rate (AIR)  threshold of 0.9m bit/symbol, where m is the number of bits per constellation point. Additionally, we added AWGN-only constellations (xx(x)M-AWGN) to serve as a reference.


Each modulation order is placed in a separate folder, in which, every text file has the coordinates for the in-phase and quadrature components of each symbol in the first and the second column, respectively. The bit mapping for each symbol is natural mapping for the line number, i.e., 000 001 010 011 100 etc.


Large p small n problem is a challenging problem in big data analytics. There are no de facto standard methods available to it. In this study, we propose a tensor decomposition (TD) based unsupervised feature extraction (FE) formalism applied to multiomics datasets, where the number of features is more than 100000 while the number of instances is as small as about 100.


We build an original dataset of thermal videos and images that simulate illegal movements around the border and in protected areas and are designed for training machines and deep learning models. The videos are recorded in areas around the forest, at night, in different weather conditions – in the clear weather, in the rain, and in the fog, and with people in different body positions (upright, hunched) and movement speeds (regu- lar walking, running) at different ranges from the camera.



About 20 minutes of recorded material from the clear weather scenario, 13 minutes from the fog scenario, and about 15 minutes from rainy weather were processed. The longer videos were cut into sequences and from these sequences individual frames were extracted, resulting in 11,900 images for the clear weather, 4,905 images for the fog, and 7,030 images for the rainy weather scenarios.

A total of 6,111 frames were manual annotated so that could be used to train the supervised model for person detection. When selecting the frames, it was taken into account that the selected frames include different weather conditions so that in the set there were 2,663 frames shot in clear weather conditions, 1,135 frames of fog, and 2,313 frames of rain.

The annotations were made using the open-source Yolo BBox Annotation Tool that can simultaneously store annotations in the three most popular machine learning annotation formats YOLO, VOC, and MS COCO so all three annotation formats are available. The image annotation consists of a centroid position of the bounding box around each object of interest, size of the bounding box in terms of width and height, and corresponding class label (Human or Dog).



Coventry-2018 is a human activity recognition dataset captured by three Panasonic® Grid-EYE (AMG8833) infrared sensors in March 2018. The Grid-EYE sensors represent a 60 field of view scene by an 8 × 8 array named frame. The data streams are synchronized to 10 frames per second and saved as *.csv recordings using the LabVIEW® software. Two layouts are considered in this dataset with different geometry sizes: 1) small layout; and 2) large layout.


Presented here is a dataset used for our SCADA cybersecurity research. The dataset was built using our SCADA system testbed described in our paper below [*]. The purpose of our testbed was to emulate real-world industrial systems closely. It allowed us to carry out realistic cyber-attacks.



Provided dataset is cleased, pre-processed, and ready to use. The users may modify as they wish, but please cite the dataset as below.

M. A. Teixeira, M. Zolanvari, R. Jain, "WUSTL-IIOT-2018 Dataset for ICS (SCADA) Cybersecurity Research," 2018. [Online]. Available:


Message Queuing Telemetry Transport (MQTT) protocol is one of the most used standards used in Internet of Things (IoT) machine to machine communication. The increase in the number of available IoT devices and used protocols reinforce the need for new and robust Intrusion Detection Systems (IDS). However, building IoT IDS requires the availability of datasets to process, train and evaluate these models. The dataset presented in this paper is the first to simulate an MQTT-based network. The dataset is generated using a simulated MQTT network architecture.


The dataset consists of 5 pcap files, namely, normal.pcap, sparta.pcap, scan_A.pcap, mqtt_bruteforce.pcap and scan_sU.pcap. Each file represents a recording of one scenario; normal operation, Sparta SSH brute-force, aggressive scan, MQTT brute-force and UDP scan respectively. The attack pcap files contain background normal operations. The attacker IP address is “”. Basic packet features are extracted from the pcap files into CSV files with the same pcap file names. The features include flags, length, MQTT message parameters, etc. Later, unidirectional and bidirectional features are extracted.  It is important to note that for the bidirectional flows, some features (pointed as *) have two values—one for forward flow and one for the backward flow. The two features are recorded and distinguished by a prefix “fwd_” for forward and “bwd_” for backward.