This is a communication dataset for the simulation of WSNs in TinyOS.

There are two groups of files. They are used separately for the simulation of topology and noise in the communication of WSNs. They work for the platform TinyOS.


This sheet contains the answers from our european Cyber Security MSc Education Survey. The data shows which knowledge units various educations in Europe cover and to which extend. We drew conclusions in the paper "Are We Preparing Students to Build Security In? A Survey of European Cybersecurity in Higher Education Programs". The present dataset is newer and therefore extends the one we based our paper on.


The file is an excel .xlsx file, so you can open it in Excel, LibreOffice, OpenOffice or a similar spread sheet tool. The file has 3 sheets:

  • Universities: Contains all the raw data
  • KAs and KUs: Contains the mapping of each knowledge unit to a knowledge area
  • Explanation: Contains an explanation of the data. It also contains a few errata for our paper based on a previous version of the data.

Dataset used in the article "An Ensemble Method for Keystroke Dynamics Authentication in Free-Text Using Word Boundaries". For each user and free-text sample of the companion dataset LSIA, contains a CSV file with the list of words in the sample that survived the filters described in the article, together with the CSV files with training instances for each word. The source data comes from a dataset used in previous studies by the authors. The language of the free-text samples is Spanish.


66% of Prestashop websites are at high risk from cyber criminals.

Common Hacks in Prestashop


The Dada dataset is associated with the paper “Debiasing Android Malware Datasets: How can I trust your results if your dataset is biased?”. The goal of this dataset is to provide a new updated dataset of goodware/malware applications that can be used by other researchers for performing experiments, for example, detection or classification algorithms. The dataset contains the applications hashes and some characteristics.


Quick-start for using the output datasets for your own experiment

If you just want to use the mixed datasets (goodware/malware) for your experiments, you should do:

python3 api_key_androzoo ./

with api_key_androzoo being your API key file provided by the team administrating Androzoo. This script downloads applications from AndroZoo, according to the result of debiasing Drebin/VirusShare mixed with Naze. This result is cached for you.

Two datasets are provided:

  • DN: a debiased version of Drebin mixed with goodware from Androzoo (called Naze)
  • VSN: a debiased version of VirusShare mixed with goodware from Androzoo (called Naze)

├── DN
│   ├── drebin_debiased-naze_debiased-test-10.0p
│   ├── drebin_debiased-naze_debiased-test-5p
│   └── drebin_debiased-naze_debiased-training
├── VSN
│   ├── vs15-18_debiased-naze_debiased-2017-test-10.0p
│   ├── vs15-18_debiased-naze_debiased-2017-test-5p
│   └── vs15-18_debiased-naze_debiased-2017-training

More information about how these datasets have been constructed is given in the paper and this README.


We provide each dataset as a list of hashes in a file and some additional information such as if an APK is a malware or not for mixed datasets. As the primary intent of this work is to debias datasets, we do not need (nor provide) the APKs themselves. Nevertheless, one can recover all these datasets’ content with helper scripts, as explained at the end of this document.

All dataset information is located in the files of the datasets/ folder.

File structure

  • file.sha256.txt: hashes of the applications of the dataset
  • file.characteristics.csv: the characteristics for each SHA-256 hash
  • file.goodmal.csv: the information about the class (goodware=0 or malware=1) when the dataset is mixed. This file is optional when the dataset is a full goodware or malware file.

The header of the characteristics.csv file is:

sha256,date,year,APK size,Personal information,Leak information,Phone integrity,Denial of service,Intrusion

Malware datasets:

The datasets of the paper correspond to the files:

  • drebin: drebin
  • AMD: amd
  • VS 2015: virusshare-2015
  • VS 2016: virusshare-2016
  • VS 2017: virusshare-2017
  • VS 2018: virusshare-2018

Androzoo extracts:

The datasets of the paper correspond to the files:

  • AZ19_100k: androzoo-100k
  • AZ19_100k 2015: androzoo-100k-2015
  • AZ19_100k 2016: androzoo-100k-2016
  • AZ19_100k 2017: androzoo-100k-2017
  • AZ19_100k 2018: androzoo-100k-2018
  • AZ20 10k: androzoo-10k-2020
  • AZ20 20k: androzoo-20k-2020
  • AZ20 30k: androzoo-30k-2020

Note that a few applications have been removed from these extracts as analysis tools like apktool fail to analyze these apps.

Goodware datasets:

The datasets of the paper correspond to the files:

  • NAZE-18-G: goodware-2018
  • NAZE_Debiased-18-G: debias-goodware-2018-to-30k-0025

Debiased malware datasets:

The datasets of the paper correspond to the files:

  • Drebin_Debiased: debias-drebin-to-30k-0025
  • VS_Debiased-15-18: debias-vs15-18-to-30k-02

For delta in 0.{0025,005, 01, 02, 04}:

  • VS_Debiased-15: debias-vs2015-to-az100k-2015-delta
  • VS_Debiased-16: debias-vs2016-to-az100k-2016-delta
  • VS_Debiased-17: debias-vs2017-to-az100k-2017-delta
  • VS_Debiased-18: debias-vs2018-to-az100k-2018-delta

Mixed dataset:

These datasets contain the additional file file.goodmal.csv.


The datasets of the paper correspond to the files:

  • Dmix: mix-drebin-two-third-goodware

Drebin_Debiased + NAZE_Debiased-18-G

These datasets have been built to be directly usable for machine learning algorithms. For downloading them, you can go to the end of this document. Downloading all APKs of these datasets is not required to execute the debiasing algorithms.

The datasets of the paper correspond to the files:

Training sets:

  • DN50-NoC2: drebin_debiased-naze_debiased-training_no-bal-time-win
  • DN50: drebin_debiased-naze_debiased-training

Test sets:

  • DN5-NoC2: drebin_debiased-naze_debiased-test-5p_no-bal-time-win
  • DN5: drebin_debiased-naze_debiased-test-5p
  • DN10-NoC2: drebin_debiased-naze_debiased-test-10.0p_no-bal-time-win
  • DN10: drebin_debiased-naze_debiased-test-10.0p

Goodware/Malware information:

  • DN50-NoC2: drebin_debiased-naze_debiased-training_no-bal-time-win.goodmal.csv
  • DN50: drebin_debiased-naze_debiased-training.goodmal.csv
  • DN5-NoC2: drebin_debiased-naze_debiased-test-5p_no-bal-time-win.goodmal.csv
  • DN5: drebin_debiased-naze_debiased-test-5p.goodmal.csv
  • DN10-NoC2: drebin_debiased-naze_debiased-test-10.0p_no-bal-time-win.goodmal.csv
  • DN10: drebin_debiased-naze_debiased-test-10.0p.goodmal.csv

VS_Debiased-15-18 + NAZE_Debiased-18-G

Training sets:

  • VSN50-NoC2: vs15-18_debiased-naze_debiased-2017-training_no-bal-time-win
  • VSN50: vs15-18_debiased-naze_debiased-2017-training

Test sets:

  • VSN5-NoC2: vs15-18_debiased-naze_debiased-2017-test-5p_no-bal-time-win
  • VSN5: vs15-18_debiased-naze_debiased-2017-test-5p
  • VSN10-NoC2: vs15-18_debiased-naze_debiased-2017-test-10.0p_no-bal-time-win
  • VSN10: vs15-18_debiased-naze_debiased-2017-test-10.0p

Goodware/Malware information:

  • VSN50-NoC2: vs15-18_debiased-naze_debiased-2017-training_no-bal-time-win.goodmal.csv
  • VSN50: vs15-18_debiased-naze_debiased-2017-training.goodmal.csv
  • VSN5-NoC2: vs15-18_debiased-naze_debiased-2017-test-5p_no-bal-time-win.goodmal.csv
  • VSN5: vs15-18_debiased-naze_debiased-2017-test-5p.goodmal.csv
  • VSN10-NoC2: vs15-18_debiased-naze_debiased-2017-test-10.0p_no-bal-time-win.goodmal.csv
  • VSN10: vs15-18_debiased-naze_debiased-2017-test-10.0p.goodmal.csv

Usage of debiasing and evaluation scripts

To replay the dataset debiasing process (or try with a new arrangment of datasets), scripts are provided for this mean. For debiasing, we need to generate the possible classes. Classes are defined by all the observed combinations of characteristics.


First, install the packages using the requirements.txt.

pip3 install -r requirements.txt

Generation of classes

In order to perform the debiasing algorithm, multiple subfolders are required to generate new files that contain the classes (combinations of characteristics) and which APK are in these classes. These subfolders are generated into “input_datasets”.

To generate these subfolders, use the following command:

python3 config.datasets.ini datasets

For example, for drebin, the script generates:


  • the characteristics.csv file is a copy of the original file located in the datasets/ folder
  • the classcount.csv file indicates the number of APK that matches a combination of characteristic (a class). For example, for Drebin, the class “1,1,1,0,0,0” contains 543 applications.
  • the combination_hashes.json contains a dictionary that associates for each combination of characteristics (a class) the list of sha256 APK files

This information can be used later more efficiently when debiasing datasets.

Single dataset debiasing algorithm

For debiasing one base dataset (or an union of base datasets), with a target dataset and a list of source datasets, the user should use the datasets files’ basename in the dataset folder. The command line looks like the following:

python3 config.datasets.ini \
new_dataset_name \
"Name of the new dataset" \
--base-datasets base_dataset_list \
--target-dataset target_dataset_name \
--source-datasets list \
--delta value

with delta the value in [0,1] for controlling the distance of the output with the target dataset.

Four outputs are expected in out/new_dataset_name and are similar to the outputs of the generation of classes:

  • the file new_dataset_name.characteristics.csv: the characteristics.
  • the file new_dataset_name.features_specs.classcount.csv: the number of APK for each combination of characteristics (class).
  • the file new_dataset_name.features_specs.combination_hashes.json: contains a dictionary that associates a class with all SHA256 APKs.
  • the file new_dataset_name.features_specs.dataset_class_info.json: some general information about the experiment:
    • size: size of the dataset
    • modified: true if this dataset has been generated
    • base dataset and original size
    • target dataset
    • source dataset list
    • delta
    • number of combinations (classes)
    • combination not found: the classes that are empty: we cannot found any APK representing this class
    • debiasable: false if the debiasing algorithm fails
    • added: the number of APK added from the source in this new dataset
    • removed: the number of APK removed from the base dataset
    • d_min final: the d_min value at the end in the paper algorithm
    • add ratio: ratio of addition of new APK over the size of the generated dataset, between 0 and 1
    • run time: the duration of the debiasing algorithm

For example, for the Drebin dataset as input, with androzoo-30k-2020 as target dataset, with amd and virusshare-201{5,6,7,8} as source datasets, and for a delta of 0.04, the user should launch:

python3 config.datasets.ini \
debias-drebin-to-30k-04-replay \
"Replay (0.04) Debiased Drebin --> AndroZoo 30k (2020)" \
--base-datasets drebin \
--target-dataset androzoo-30k-2020 \
--source-datasets amd virusshare-201{5,6,7,8} \
--delta "0.04"

We call this experiment “Replay” because the user replays the debiasing algorithm and should obtain similar results as the already provided dataset datasets/debias-drebin-to-30k-04.

Replaying this experiment generates in the folder out/debias-drebin-to-30k-04-replay/ the files:


In particular, in the class_info file, we note that:

  • 886 apps have been added
  • 2421 apps have been removed
  • the add ratio is 23.5%
  • the final dataset size is 3769

Even though the algorithm will generate a different dataset each time, the number of elements per class is the same in every re-run with the same base, target and source datasets, and delta. To verify this, the user can check the difference of the classcount.csv files between the original and the replay:

diff <(sort input_datasets/debias-drebin-to-30k-04/debias-drebin-to-30k-04.features_specs.classcount.csv) \
<(sort out/debias-drebin-to-30k-04-replay/debias-drebin-to-30k-04-replay.features_specs.classcount.csv)

If the debiasing algorithm fails, several solutions can be tested:

  • increase the delta value to let the output be farther from the target
  • provide more samples in the source datasets (probably some classes do not have enough applications)

Comparing datasets with a population

When new datasets are generated, or with the input datasets, the user may want to evaluate the distance between these datasets and an extract of the population. In particular, we provide a script to evaluate the Chi2 and the p-value, using the following command:

python3 config.datasets.ini
--population population_dataset \
--datasets list of datasets to evaluate \
--append-population-name \
--filename "filename"

The parameters are: - population: indicates the name of the dataset to use as an extract of the considered population - datasets: a list of dataset names that can be located both in the datasets/ or out/ folders. - append-population-name option: add the name of the population in the output file - filename: the name of the output file

The outputs are:

  • out/count_analysis_output/filename_population_dataset.xlsx: a tabular containing the comparison of the considered datasets (Chi2, p-value, added/removed app count, etc.)
  • out/count_analysis_output/filename_population_dataset.tex: a latex tabular containing max delta and the size of datasets.

These outputs, in particular the latex output, can be customized to your needs.

For example, for comparing the following three datasets with the extract of AndroZoo of size 30k extracted in 2020 (androzoo-30k-2020):

  • drebin: the original Drebin dataset
  • debias-drebin-to-30k-04: the debiased dataset already computed and dropped in the datasets/ folder
  • debias-drebin-to-30k-04-replay: the debiased dataset just generated by following this README

The user should use the following command:

python3 config.datasets.ini \
--population androzoo-30k-2020 \
--datasets drebin debias-drebin-to-30k-04{,-replay} \
--append-population-name --filename "drebin_debias_replay"

As shown in the output of the script, debias-drebin-to-30k-04 and the replay (debias-drebin-to-30k-04-replay) have the same Chi2 value, which is expected. The file out/count_analysis_output/drebin_debias_replay_androzoo-30k-2020.xlsx contains a table with information about Drebin and the provided debiased dataset and the new generated one:

Count analysis result

Mix dataset debiasing algorithm

For producing mixed datasets, we provide a script that takes two datasets as input: one should contain the malware, the other the goodware.

The command is the following:

python3 config.datasets.ini \
"id_name_of_the_generated_dataset" "Full name of the generated dataset" \
id_debiased_malware_dataset \
id_debiased_goodware_dataset \
year-time-barrier_training-test \
--date-fix sha256.dex_date.vt_date.txt

with the parameters:

  • id_debiased_malware_dataset: the id of the malware dataset that will be loaded from folder datasets/ and out/.
  • id_debiased_goodware_dataset: the id of the goodware dataset that will be loaded from folder datasets/ and out/.
  • year-time-barrier_training-test: an integer representing the year used to split the datasets into the training part and the test part.
  • date-fix sha256.dex_date.vt_date.txt: a helper file that the user should provide to help the identification of the date of broken APKs. Indeed, some APK has a date of 0 when extracting the date from the APK archive (zip date construction). In this case, the script can open the helper file to search for an alternative date.
  • (optional) percent: specify the percent of malware applications for the output test dataset (the default is 5%)

For example, for mixing the “debiased Drebin” dataset just created before (drebin_debiased-naze_debiased-replay) and the “debiased NAZE” dataset, and for using 2013 as a barrier for delimitating the training set and the test set, the user should do:

python3 config.datasets.ini \
"drebin_debiased-naze_debiased-replay" "Drebin Debiased replay + NAZE Debiased (Replay)" \
debias-drebin-to-30k-04-replay \
debias-goodware-2018-to-30k-0025 \
2013 \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt

To leave the C2 condition out, add the “no-balance-time-window” option:

python3 config.datasets.ini \
"drebin_debiased-naze_debiased-replay" "Drebin Debiased replay + NAZE Debiased (Replay)" \
debias-drebin-to-30k-04-replay \
debias-goodware-2018-to-30k-0025 \
2013 \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt \

For specifying 10% of malware, add the “percent” option followed by 10:

python3 config.datasets.ini \
"drebin_debiased-naze_debiased-replay" "Drebin Debiased replay + NAZE Debiased (Replay)" \
debias-drebin-to-30k-04-replay \
debias-goodware-2018-to-30k-0025 \
2013 \
--percent 10 \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt

This script outputs two folders, one for the training set, one for the test set. The content of this folder is similar to the debiasing of a single dataset. For example, the mixing of Drebin and Naze generates:

  • out/drebin_debiased-naze_debiased-replay-training
    • drebin_debiased-naze_debiased-replay-training.characteristics.csv
    • drebin_debiased-naze_debiased-replay-training.features_specs.classcount.csv
    • drebin_debiased-naze_debiased-replay-training.features_specs.combination_hashes.json
    • drebin_debiased-naze_debiased-replay-training.features_specs.dataset_class_info.json
  • out/drebin_debiased-naze_debiased-replay-test-5p
    • drebin_debiased-naze_debiased-replay-test-5p.characteristics.csv
    • drebin_debiased-naze_debiased-replay-test-5p.features_specs.classcount.csv
    • drebin_debiased-naze_debiased-replay-test-5p.features_specs.combination_hashes.json
    • drebin_debiased-naze_debiased-replay-test-5p.features_specs.dataset_class_info.json

Comparing the intersection of two datasets

To count the number of elements in these replays and the original ones, can be used for this purpose:

python3 config.datasets.ini \
--source-datasets list of datasets used for mixing \
--datasets mixed datasets \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt

The source datasets are the ones that have been used for producing the mixed datasets. The script helps to control the good balance of applications in the produced mixed datasets.

For example, for analysing the mixed dataset “debiased Drebin” and “debiased NAZE”, the user should do:

python3 config.datasets.ini \
--source-datasets debias-drebin-to-30k-04-replay debias-goodware-2018-to-30k-0025 \
--datasets drebin_debiased-naze_debiased-replay-training drebin_debiased-naze_debiased-replay-test-5p \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt

The output shows that the test set does not contain any application from debias-drebin or debias-goodware before 2013. Indeed, the test set should start for years greater than 2013. We also show that the training set is balanced between goodware and malware for each year.

For example, the training set contains the following (malware/goodware balanced) for the last available year:

Total for 2013:
| | Replay (0.04) Debiased Dr | (0.0025) Debiased AndroZo |
| | ebin --> AndroZoo 30k (20 | o Goodware (2018) --> And |
| | 20) | roZoo 30k (2020) |
| Drebin Debiased replay + NAZE Debiased (Replay)-training | 81 | 81 |

And the test set contains the following (5% malware) for the first available year:

Total for 2014:
| | Replay (0.04) Debiased Dr | (0.0025) Debiased AndroZo |
| | ebin --> AndroZoo 30k (20 | o Goodware (2018) --> And |
| | 20) | roZoo 30k (2020) |
| Drebin Debiased replay + NAZE Debiased (Replay)-test-5p | 8 | 169 |

Notice that, because the hashes in the debiased datasets are different most of the time, the results shown may differ from the ones obtained with a new “debiased Drebin” and “debiased NAZE”. However, using the same datasets as inputs (the ones generated in the previous section), but with a different “id” and “name”, the result mix dataset will have the same number of hashes.

Performing all debiasing experiments

For reproducing all experiments produced in Table II, the user can launch the following script:


For reproducing all experiments produced in Table III, after reproducing the ones of Table II, the user can do:


Downloading APK datasets

We cannot provide the samples directly in this zip archive, as our institution does not allow us to do so. Nevertheless, we provide scripts to recover them from the sha256.txt files.

Goodware datasets

Goodware datasets can be downloaded from AndroZoo, using the script “”:

usage: [-h] api_key_file hash_list_file output_dir

For example for Drebin:

python3 api_key_androzoo datasets/drebin.sha256.txt tmp
Num hashes: 5304
sha256 to download: a7f5522c5775945950aab6531979c78fd407238131fabd94a0cb47343a402f91

Malware datasets

Malware datasets can be partially found in AndroZoo. Drebin and AMD are available, but all VirusShare datasets should be downloaded from the VirusShare website.

Mixed datasets

The mixed datasets can be fully downloaded from AndroZoo:

Usage: python3 api_key_androzoo outdir

python3 api_key_androzoo ./

The script creates the following tree and populates them:

├── DN
│   ├── drebin_debiased-naze_debiased-test-10.0p
│   ├── drebin_debiased-naze_debiased-test-5p
│   └── drebin_debiased-naze_debiased-training
├── DN-NoC2
│   ├── drebin_debiased-naze_debiased-test-10.0p_no-bal-time-win
│   ├── drebin_debiased-naze_debiased-test-5p_no-bal-time-win
│   └── drebin_debiased-naze_debiased-training_no-bal-time-win
├── VSN
│   ├── vs15-18_debiased-naze_debiased-2017-test-10.0p
│   ├── vs15-18_debiased-naze_debiased-2017-test-5p
│   └── vs15-18_debiased-naze_debiased-2017-training
└── VSN-NoC2
├── vs15-18_debiased-naze_debiased-2017-test-10.0p_no-bal-time-win
├── vs15-18_debiased-naze_debiased-2017-test-5p_no-bal-time-win
└── vs15-18_debiased-naze_debiased-2017-training_no-bal-time-win

The script can be interrupted and you can relaunch the download.


This dataset contains multimodal sensor data collected from side-channels while printing several types of objects on an Ultimaker 3 3D printer. Our related research paper titled "Sabotage Attack Detection for Additive Manufacturing Systems" can be found here: In our work, we demonstrate that this sensor data can be used with machine learning algorithms to detect sabotage attacks on the 3D printer.


The DroneDetect dataset consists of 7 different models of popular Unmanned Aerial Systems (UAS) including the new DJI Mavic 2 Air S, DJI Mavic Pro, DJI Mavic Pro 2, DJI Inspire 2, DJI Mavic Mini, DJI Phantom 4 and the Parrot Disco. Recordings were collected using a Nuand BladeRF SDR and using open source software GNURadio. There are 4 subsets of data included in this dataset, the UAS signals in the presence of Bluetooth interference, in the presence of Wi-Fi signals, in the presence of both and with no interference.


Sample rate: 60Mbits/s

Bandwidth: 28MHz

Centre Freq: 2.4375GHz

Each recording consists of 1.2 x 10^8 complex samples equating to 2 seconds recording time. Data is saved into ‘.dat’ files  and the complex data is saved as interleaved floats. ‘’ is included for the data to be loaded into python and further split into smaller samples 20ms in length.

Files are categorised by interference, then by flight mode –

Switched on = ON

Hovering = HO

Flying = FY

Each file name uses an interference identifier, 00 for a clean signal, 01 for Bluetooth only, 10 for Wi-Fi only and 11 for Bluetooth and Wi-Fi interference concurrently. An example file name for Mavic Mini switched on in the presence of Bluetooth and Wi-Fi interference would be:

MIN + 11 + 00 + 00 = MIN_1100_00.dat





The .zip archive contains a folder ‘tasks’, and a .csv file, “analysis_results.csv” which is a table with 4077 entries. The .csv table is delimeted by comma. Each subfolder of the ‘tasks’ folder represents an analysis task of a unique sample. The association between tasks and samples is shown in the analysis_results.csv table, which contains the analysis results per sample. Each row in the table represents a botnet sample and holds information such as analysis task id, file hash, URL of the server where the sample was captured from, as well as the analysis results for that sample.  For each task id, the corresponding folder contains: 1) the results of the analysis (analysis_result.json); 2) the captured traffic (capture.pcap); 3) the recorded system calls (syscalls.json) and 4) the botnet sample file (ELF binary) with the original filename. Depending on the IoT botnet sample analysed, the network traffic may include port scanning, exploitation, C2 communications and DDoS traffic.




Dataset used in the article "The Reverse Problem of Keystroke Dynamics: Guessing Typed Text with Keystroke Timings". CSV files with dataset results summaries, the evaluated sentences, detailed results, and scores. Results data contains training and evaluation ARFF files for each user, containing features of synthetic and legitimate samples as described in the article. The source data comes from three free text keystroke dynamics datasets used in previous studies, by the authors (LSIA) and two other unrelated groups (KM, and PROSODY, subdivided in GAY, GUN, and REVIEW).


Dataset including over 40,000 generated images of malicious binaries for malware classification in machine learning as outlined in NARAD - A Novel Auto-learn Real-time Fuzzy Machine Learning Anomaly Detection and Classification System.