Dada: Debiased Android DAtasets

Citation Author(s):
Tomas
Concepcion Miranda
CentraleSupélec - Inria
Pierre-Francois
Gimenez
CentraleSupélec - Inria
Jean-Francois
Lalande
CentraleSupélec - Inria
Valérie
Viet Triem Tong
CentraleSupélec - Inria
Pierre
Wilke
CentraleSupélec - Inria
Submitted by:
Jean-Francois L...
Last updated:
Wed, 06/23/2021 - 13:21
DOI:
10.21227/0frv-zb46
Data Format:
Links:
License:
162 Views
Categories:
Keywords:
0
0 ratings - Please login to submit your rating.

Abstract 

The Dada dataset is associated with the paper “Debiasing Android Malware Datasets: How can I trust your results if your dataset is biased?”. The goal of this dataset is to provide a new updated dataset of goodware/malware applications that can be used by other researchers for performing experiments, for example, detection or classification algorithms. The dataset contains the applications hashes and some characteristics. The dataset DOES NOT contain the malware themselves but one can download them with their hash from well-known repositories such as AndroZoo and VirusShare.

We provide well-known old datasets (Drebin, AMD) and several extracts of the AndroZoo and VirusShare repository of applications. We also provide the pre-computed output of the debiasing process of labeled biased datasets modified to resemble an extract of AndroZoo. Researchers can directly use these outputs to perform their own experiments. We also provide the scripts that implement the proposed debiasing algorithm to make our experiments fully reproducible.

Instructions: 

Quick-start for using the output datasets for your own experiment

If you just want to use the mixed datasets (goodware/malware) for your experiments, you should do:

python3 download_mixed_datasets.py api_key_androzoo ./

with api_key_androzoo being your API key file provided by the team administrating Androzoo. This script downloads applications from AndroZoo, according to the result of debiasing Drebin/VirusShare mixed with Naze. This result is cached for you.

Two datasets are provided:

  • DN: a debiased version of Drebin mixed with goodware from Androzoo (called Naze)
  • VSN: a debiased version of VirusShare mixed with goodware from Androzoo (called Naze)

.
├── DN
│   ├── drebin_debiased-naze_debiased-test-10.0p
│   ├── drebin_debiased-naze_debiased-test-5p
│   └── drebin_debiased-naze_debiased-training
├── VSN
│   ├── vs15-18_debiased-naze_debiased-2017-test-10.0p
│   ├── vs15-18_debiased-naze_debiased-2017-test-5p
│   └── vs15-18_debiased-naze_debiased-2017-training

More information about how these datasets have been constructed is given in the paper and this README.

Datasets

We provide each dataset as a list of hashes in a file and some additional information such as if an APK is a malware or not for mixed datasets. As the primary intent of this work is to debias datasets, we do not need (nor provide) the APKs themselves. Nevertheless, one can recover all these datasets’ content with helper scripts, as explained at the end of this document.

All dataset information is located in the files of the datasets/ folder.

File structure

  • file.sha256.txt: hashes of the applications of the dataset
  • file.characteristics.csv: the characteristics for each SHA-256 hash
  • file.goodmal.csv: the information about the class (goodware=0 or malware=1) when the dataset is mixed. This file is optional when the dataset is a full goodware or malware file.

The header of the characteristics.csv file is:

sha256,date,year,APK size,Personal information,Leak information,Phone integrity,Denial of service,Intrusion

Malware datasets:

The datasets of the paper correspond to the files:

  • drebin: drebin
  • AMD: amd
  • VS 2015: virusshare-2015
  • VS 2016: virusshare-2016
  • VS 2017: virusshare-2017
  • VS 2018: virusshare-2018

Androzoo extracts:

The datasets of the paper correspond to the files:

  • AZ19_100k: androzoo-100k
  • AZ19_100k 2015: androzoo-100k-2015
  • AZ19_100k 2016: androzoo-100k-2016
  • AZ19_100k 2017: androzoo-100k-2017
  • AZ19_100k 2018: androzoo-100k-2018
  • AZ20 10k: androzoo-10k-2020
  • AZ20 20k: androzoo-20k-2020
  • AZ20 30k: androzoo-30k-2020

Note that a few applications have been removed from these extracts as analysis tools like apktool fail to analyze these apps.

Goodware datasets:

The datasets of the paper correspond to the files:

  • NAZE-18-G: goodware-2018
  • NAZE_Debiased-18-G: debias-goodware-2018-to-30k-0025

Debiased malware datasets:

The datasets of the paper correspond to the files:

  • Drebin_Debiased: debias-drebin-to-30k-0025
  • VS_Debiased-15-18: debias-vs15-18-to-30k-02

For delta in 0.{0025,005, 01, 02, 04}:

  • VS_Debiased-15: debias-vs2015-to-az100k-2015-delta
  • VS_Debiased-16: debias-vs2016-to-az100k-2016-delta
  • VS_Debiased-17: debias-vs2017-to-az100k-2017-delta
  • VS_Debiased-18: debias-vs2018-to-az100k-2018-delta

Mixed dataset:

These datasets contain the additional file file.goodmal.csv.

Dmix

The datasets of the paper correspond to the files:

  • Dmix: mix-drebin-two-third-goodware

Drebin_Debiased + NAZE_Debiased-18-G

These datasets have been built to be directly usable for machine learning algorithms. For downloading them, you can go to the end of this document. Downloading all APKs of these datasets is not required to execute the debiasing algorithms.

The datasets of the paper correspond to the files:

Training sets:

  • DN50-NoC2: drebin_debiased-naze_debiased-training_no-bal-time-win
  • DN50: drebin_debiased-naze_debiased-training

Test sets:

  • DN5-NoC2: drebin_debiased-naze_debiased-test-5p_no-bal-time-win
  • DN5: drebin_debiased-naze_debiased-test-5p
  • DN10-NoC2: drebin_debiased-naze_debiased-test-10.0p_no-bal-time-win
  • DN10: drebin_debiased-naze_debiased-test-10.0p

Goodware/Malware information:

  • DN50-NoC2: drebin_debiased-naze_debiased-training_no-bal-time-win.goodmal.csv
  • DN50: drebin_debiased-naze_debiased-training.goodmal.csv
  • DN5-NoC2: drebin_debiased-naze_debiased-test-5p_no-bal-time-win.goodmal.csv
  • DN5: drebin_debiased-naze_debiased-test-5p.goodmal.csv
  • DN10-NoC2: drebin_debiased-naze_debiased-test-10.0p_no-bal-time-win.goodmal.csv
  • DN10: drebin_debiased-naze_debiased-test-10.0p.goodmal.csv

VS_Debiased-15-18 + NAZE_Debiased-18-G

Training sets:

  • VSN50-NoC2: vs15-18_debiased-naze_debiased-2017-training_no-bal-time-win
  • VSN50: vs15-18_debiased-naze_debiased-2017-training

Test sets:

  • VSN5-NoC2: vs15-18_debiased-naze_debiased-2017-test-5p_no-bal-time-win
  • VSN5: vs15-18_debiased-naze_debiased-2017-test-5p
  • VSN10-NoC2: vs15-18_debiased-naze_debiased-2017-test-10.0p_no-bal-time-win
  • VSN10: vs15-18_debiased-naze_debiased-2017-test-10.0p

Goodware/Malware information:

  • VSN50-NoC2: vs15-18_debiased-naze_debiased-2017-training_no-bal-time-win.goodmal.csv
  • VSN50: vs15-18_debiased-naze_debiased-2017-training.goodmal.csv
  • VSN5-NoC2: vs15-18_debiased-naze_debiased-2017-test-5p_no-bal-time-win.goodmal.csv
  • VSN5: vs15-18_debiased-naze_debiased-2017-test-5p.goodmal.csv
  • VSN10-NoC2: vs15-18_debiased-naze_debiased-2017-test-10.0p_no-bal-time-win.goodmal.csv
  • VSN10: vs15-18_debiased-naze_debiased-2017-test-10.0p.goodmal.csv

Usage of debiasing and evaluation scripts

To replay the dataset debiasing process (or try with a new arrangment of datasets), scripts are provided for this mean. For debiasing, we need to generate the possible classes. Classes are defined by all the observed combinations of characteristics.

Requirements

First, install the packages using the requirements.txt.

pip3 install -r requirements.txt

Generation of classes

In order to perform the debiasing algorithm, multiple subfolders are required to generate new files that contain the classes (combinations of characteristics) and which APK are in these classes. These subfolders are generated into “input_datasets”.

To generate these subfolders, use the following command:

python3 csv_to_combinations_hashes.py config.datasets.ini datasets

For example, for drebin, the script generates:

drebin.characteristics.csv
drebin.features_specs.classcount.csv
drebin.features_specs.combination_hashes.json

  • the characteristics.csv file is a copy of the original file located in the datasets/ folder
  • the classcount.csv file indicates the number of APK that matches a combination of characteristic (a class). For example, for Drebin, the class “1,1,1,0,0,0” contains 543 applications.
  • the combination_hashes.json contains a dictionary that associates for each combination of characteristics (a class) the list of sha256 APK files

This information can be used later more efficiently when debiasing datasets.

Single dataset debiasing algorithm

For debiasing one base dataset (or an union of base datasets), with a target dataset and a list of source datasets, the user should use the datasets files’ basename in the dataset folder. The command line looks like the following:

python3 debias_script.py config.datasets.ini \
new_dataset_name \
"Name of the new dataset" \
--base-datasets base_dataset_list \
--target-dataset target_dataset_name \
--source-datasets list \
--delta value

with delta the value in [0,1] for controlling the distance of the output with the target dataset.

Four outputs are expected in out/new_dataset_name and are similar to the outputs of the generation of classes:

  • the file new_dataset_name.characteristics.csv: the characteristics.
  • the file new_dataset_name.features_specs.classcount.csv: the number of APK for each combination of characteristics (class).
  • the file new_dataset_name.features_specs.combination_hashes.json: contains a dictionary that associates a class with all SHA256 APKs.
  • the file new_dataset_name.features_specs.dataset_class_info.json: some general information about the experiment:
    • size: size of the dataset
    • modified: true if this dataset has been generated
    • base dataset and original size
    • target dataset
    • source dataset list
    • delta
    • number of combinations (classes)
    • combination not found: the classes that are empty: we cannot found any APK representing this class
    • debiasable: false if the debiasing algorithm fails
    • added: the number of APK added from the source in this new dataset
    • removed: the number of APK removed from the base dataset
    • d_min final: the d_min value at the end in the paper algorithm
    • add ratio: ratio of addition of new APK over the size of the generated dataset, between 0 and 1
    • run time: the duration of the debiasing algorithm

For example, for the Drebin dataset as input, with androzoo-30k-2020 as target dataset, with amd and virusshare-201{5,6,7,8} as source datasets, and for a delta of 0.04, the user should launch:

python3 debias_script.py config.datasets.ini \
debias-drebin-to-30k-04-replay \
"Replay (0.04) Debiased Drebin --> AndroZoo 30k (2020)" \
--base-datasets drebin \
--target-dataset androzoo-30k-2020 \
--source-datasets amd virusshare-201{5,6,7,8} \
--delta "0.04"

We call this experiment “Replay” because the user replays the debiasing algorithm and should obtain similar results as the already provided dataset datasets/debias-drebin-to-30k-04.

Replaying this experiment generates in the folder out/debias-drebin-to-30k-04-replay/ the files:

debias-drebin-to-30k-04-replay.characteristics.csv
debias-drebin-to-30k-04-replay.features_specs.classcount.csv
debias-drebin-to-30k-04-replay.features_specs.combination_hashes.json
debias-drebin-to-30k-04-replay.features_specs.dataset_class_info.json
debias-drebin-to-30k-04-replay.sha256.txt

In particular, in the class_info file, we note that:

  • 886 apps have been added
  • 2421 apps have been removed
  • the add ratio is 23.5%
  • the final dataset size is 3769

Even though the algorithm will generate a different dataset each time, the number of elements per class is the same in every re-run with the same base, target and source datasets, and delta. To verify this, the user can check the difference of the classcount.csv files between the original and the replay:

diff <(sort input_datasets/debias-drebin-to-30k-04/debias-drebin-to-30k-04.features_specs.classcount.csv) \
<(sort out/debias-drebin-to-30k-04-replay/debias-drebin-to-30k-04-replay.features_specs.classcount.csv)

If the debiasing algorithm fails, several solutions can be tested:

  • increase the delta value to let the output be farther from the target
  • provide more samples in the source datasets (probably some classes do not have enough applications)

Comparing datasets with a population

When new datasets are generated, or with the input datasets, the user may want to evaluate the distance between these datasets and an extract of the population. In particular, we provide a script to evaluate the Chi2 and the p-value, using the following command:

python3 count_analysis.py config.datasets.ini
--population population_dataset \
--datasets list of datasets to evaluate \
--append-population-name \
--filename "filename"

The parameters are: - population: indicates the name of the dataset to use as an extract of the considered population - datasets: a list of dataset names that can be located both in the datasets/ or out/ folders. - append-population-name option: add the name of the population in the output file - filename: the name of the output file

The outputs are:

  • out/count_analysis_output/filename_population_dataset.xlsx: a tabular containing the comparison of the considered datasets (Chi2, p-value, added/removed app count, etc.)
  • out/count_analysis_output/filename_population_dataset.tex: a latex tabular containing max delta and the size of datasets.

These outputs, in particular the latex output, can be customized to your needs.

For example, for comparing the following three datasets with the extract of AndroZoo of size 30k extracted in 2020 (androzoo-30k-2020):

  • drebin: the original Drebin dataset
  • debias-drebin-to-30k-04: the debiased dataset already computed and dropped in the datasets/ folder
  • debias-drebin-to-30k-04-replay: the debiased dataset just generated by following this README

The user should use the following command:

python3 count_analysis.py config.datasets.ini \
--population androzoo-30k-2020 \
--datasets drebin debias-drebin-to-30k-04{,-replay} \
--append-population-name --filename "drebin_debias_replay"

As shown in the output of the script, debias-drebin-to-30k-04 and the replay (debias-drebin-to-30k-04-replay) have the same Chi2 value, which is expected. The file out/count_analysis_output/drebin_debias_replay_androzoo-30k-2020.xlsx contains a table with information about Drebin and the provided debiased dataset and the new generated one:

Count analysis result

Mix dataset debiasing algorithm

For producing mixed datasets, we provide a script that takes two datasets as input: one should contain the malware, the other the goodware.

The command is the following:

python3 gen_dataset_by_parts_subset_year_c.py config.datasets.ini \
"id_name_of_the_generated_dataset" "Full name of the generated dataset" \
id_debiased_malware_dataset \
id_debiased_goodware_dataset \
year-time-barrier_training-test \
--date-fix sha256.dex_date.vt_date.txt

with the parameters:

  • id_debiased_malware_dataset: the id of the malware dataset that will be loaded from folder datasets/ and out/.
  • id_debiased_goodware_dataset: the id of the goodware dataset that will be loaded from folder datasets/ and out/.
  • year-time-barrier_training-test: an integer representing the year used to split the datasets into the training part and the test part.
  • date-fix sha256.dex_date.vt_date.txt: a helper file that the user should provide to help the identification of the date of broken APKs. Indeed, some APK has a date of 0 when extracting the date from the APK archive (zip date construction). In this case, the script can open the helper file to search for an alternative date.
  • (optional) percent: specify the percent of malware applications for the output test dataset (the default is 5%)

For example, for mixing the “debiased Drebin” dataset just created before (drebin_debiased-naze_debiased-replay) and the “debiased NAZE” dataset, and for using 2013 as a barrier for delimitating the training set and the test set, the user should do:

python3 gen_dataset_by_parts_subset_year_c.py config.datasets.ini \
"drebin_debiased-naze_debiased-replay" "Drebin Debiased replay + NAZE Debiased (Replay)" \
debias-drebin-to-30k-04-replay \
debias-goodware-2018-to-30k-0025 \
2013 \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt

To leave the C2 condition out, add the “no-balance-time-window” option:

python3 gen_dataset_by_parts_subset_year_c.py config.datasets.ini \
"drebin_debiased-naze_debiased-replay" "Drebin Debiased replay + NAZE Debiased (Replay)" \
debias-drebin-to-30k-04-replay \
debias-goodware-2018-to-30k-0025 \
2013 \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt \
--no-balance-time-window

For specifying 10% of malware, add the “percent” option followed by 10:

python3 gen_dataset_by_parts_subset_year_c.py config.datasets.ini \
"drebin_debiased-naze_debiased-replay" "Drebin Debiased replay + NAZE Debiased (Replay)" \
debias-drebin-to-30k-04-replay \
debias-goodware-2018-to-30k-0025 \
2013 \
--percent 10 \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt

This script outputs two folders, one for the training set, one for the test set. The content of this folder is similar to the debiasing of a single dataset. For example, the mixing of Drebin and Naze generates:

  • out/drebin_debiased-naze_debiased-replay-training
    • drebin_debiased-naze_debiased-replay-training.characteristics.csv
    • drebin_debiased-naze_debiased-replay-training.features_specs.classcount.csv
    • drebin_debiased-naze_debiased-replay-training.features_specs.combination_hashes.json
    • drebin_debiased-naze_debiased-replay-training.features_specs.dataset_class_info.json
  • out/drebin_debiased-naze_debiased-replay-test-5p
    • drebin_debiased-naze_debiased-replay-test-5p.characteristics.csv
    • drebin_debiased-naze_debiased-replay-test-5p.features_specs.classcount.csv
    • drebin_debiased-naze_debiased-replay-test-5p.features_specs.combination_hashes.json
    • drebin_debiased-naze_debiased-replay-test-5p.features_specs.dataset_class_info.json

Comparing the intersection of two datasets

To count the number of elements in these replays and the original ones, count_apps_per_date_by_single_dataset.py can be used for this purpose:

python3 count_apps_per_date_by_single_dataset.py config.datasets.ini \
--source-datasets list of datasets used for mixing \
--datasets mixed datasets \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt

The source datasets are the ones that have been used for producing the mixed datasets. The script helps to control the good balance of applications in the produced mixed datasets.

For example, for analysing the mixed dataset “debiased Drebin” and “debiased NAZE”, the user should do:

python3 count_apps_per_date_by_single_dataset.py config.datasets.ini \
--source-datasets debias-drebin-to-30k-04-replay debias-goodware-2018-to-30k-0025 \
--datasets drebin_debiased-naze_debiased-replay-training drebin_debiased-naze_debiased-replay-test-5p \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt

The output shows that the test set does not contain any application from debias-drebin or debias-goodware before 2013. Indeed, the test set should start for years greater than 2013. We also show that the training set is balanced between goodware and malware for each year.

For example, the training set contains the following (malware/goodware balanced) for the last available year:

Total for 2013:
+----------------------------------------------------------+-----------------------------+-----------------------------+
| | Replay (0.04) Debiased Dr | (0.0025) Debiased AndroZo |
| | ebin --> AndroZoo 30k (20 | o Goodware (2018) --> And |
| | 20) | roZoo 30k (2020) |
|----------------------------------------------------------+-----------------------------+-----------------------------|
| Drebin Debiased replay + NAZE Debiased (Replay)-training | 81 | 81 |
+----------------------------------------------------------+-----------------------------+-----------------------------+

And the test set contains the following (5% malware) for the first available year:

Total for 2014:
+---------------------------------------------------------+-----------------------------+-----------------------------+
| | Replay (0.04) Debiased Dr | (0.0025) Debiased AndroZo |
| | ebin --> AndroZoo 30k (20 | o Goodware (2018) --> And |
| | 20) | roZoo 30k (2020) |
|---------------------------------------------------------+-----------------------------+-----------------------------|
| Drebin Debiased replay + NAZE Debiased (Replay)-test-5p | 8 | 169 |
+---------------------------------------------------------+-----------------------------+-----------------------------+

Notice that, because the hashes in the debiased datasets are different most of the time, the results shown may differ from the ones obtained with a new “debiased Drebin” and “debiased NAZE”. However, using the same datasets as inputs (the ones generated in the previous section), but with a different “id” and “name”, the result mix dataset will have the same number of hashes.

Performing all debiasing experiments

For reproducing all experiments produced in Table II, the user can launch the following script:

python3 redo_table_II.py

For reproducing all experiments produced in Table III, after reproducing the ones of Table II, the user can do:

bash redo_table_III.sh

Downloading APK datasets

We cannot provide the samples directly in this zip archive, as our institution does not allow us to do so. Nevertheless, we provide scripts to recover them from the sha256.txt files.

Goodware datasets

Goodware datasets can be downloaded from AndroZoo, using the script “get_apk_from_androzoo.py”:

python3 get_apk_from_androzoo.py
usage: get_apk_from_androzoo.py [-h] api_key_file hash_list_file output_dir

For example for Drebin:

python3 get_apk_from_androzoo.py api_key_androzoo datasets/drebin.sha256.txt tmp
Num hashes: 5304
sha256 to download: a7f5522c5775945950aab6531979c78fd407238131fabd94a0cb47343a402f91
Done
...

Malware datasets

Malware datasets can be partially found in AndroZoo. Drebin and AMD are available, but all VirusShare datasets should be downloaded from the VirusShare website.

Mixed datasets

The mixed datasets can be fully downloaded from AndroZoo:

Usage: python3 download_mixed_datasets.py api_key_androzoo outdir

python3 download_mixed_datasets.py api_key_androzoo ./

The script creates the following tree and populates them:

.
├── DN
│   ├── drebin_debiased-naze_debiased-test-10.0p
│   ├── drebin_debiased-naze_debiased-test-5p
│   └── drebin_debiased-naze_debiased-training
├── DN-NoC2
│   ├── drebin_debiased-naze_debiased-test-10.0p_no-bal-time-win
│   ├── drebin_debiased-naze_debiased-test-5p_no-bal-time-win
│   └── drebin_debiased-naze_debiased-training_no-bal-time-win
├── VSN
│   ├── vs15-18_debiased-naze_debiased-2017-test-10.0p
│   ├── vs15-18_debiased-naze_debiased-2017-test-5p
│   └── vs15-18_debiased-naze_debiased-2017-training
└── VSN-NoC2
├── vs15-18_debiased-naze_debiased-2017-test-10.0p_no-bal-time-win
├── vs15-18_debiased-naze_debiased-2017-test-5p_no-bal-time-win
└── vs15-18_debiased-naze_debiased-2017-training_no-bal-time-win

The script can be interrupted and you can relaunch the download.