Dada: Debiased Android DAtasets

Citation Author(s):: Tomas Concepcion Miranda (CentraleSupélec - Inria)

Pierre-Francois Gimenez (CentraleSupélec - Inria)

Jean-Francois Lalande (CentraleSupélec - Inria)

Valérie Viet Triem Tong (CentraleSupélec - Inria)

Pierre Wilke (CentraleSupélec - Inria)
Submitted by:: Jean-Francois Lalande
Last updated:: Tue, 03/07/2023 - 11:49
DOI:: 10.21227/0frv-zb46
Data Format:: *.csv; *.txt
Links:: Project gitlab repository

1429 views

Categories:

Security

Keywords:

Android malware

CITE

Abstract

The Dada dataset is associated with the paper “Debiasing Android Malware Datasets: How can I trust your results if your dataset is biased?”. The goal of this dataset is to provide a new updated dataset of goodware/malware applications that can be used by other researchers for performing experiments, for example, detection or classification algorithms. The dataset contains the applications hashes and some characteristics. The dataset DOES NOT contain the malware themselves but one can download them with their hash from well-known repositories such as AndroZoo and VirusShare.

We provide well-known old datasets (Drebin, AMD) and several extracts of the AndroZoo and VirusShare repository of applications. We also provide the pre-computed output of the debiasing process of labeled biased datasets modified to resemble an extract of AndroZoo. Researchers can directly use these outputs to perform their own experiments. We also provide the scripts that implement the proposed debiasing algorithm to make our experiments fully reproducible.

Instructions:

Quick-start for using the output datasets for your own experiment

If you just want to use the mixed datasets (goodware/malware) for your experiments, you should do:

python3 download_mixed_datasets.py api_key_androzoo api_key_virusshare ./

with api_key_androzoo being your API key file provided by the team administrating AndroZoo, and api_key_virusshare the API key file provided by VirusShare respectively. This script downloads applications from AndroZoo, according to the result of debiasing Drebin/VirusShare mixed with Naze. This result is cached for you.

Several datasets are provided in the datasets/ folder, in particular mixed datasets with goodware/malware:

DR-AG_deb: a debiased version of Drebin (DR) mixed with goodware from AndroZoo (AG)
VS-AG_deb: a debiased version of VirusShare (VS) mixed with goodware from AndroZoo (AG)
VS-AG_deb-04: a debiased version of VirusShare (VS) with delta = 0.04 mixed with goodware from AndroZoo (AG)

        .
        ├── DR-AG_deb
        │   ├── DR-AG_deb-test(.sha256,.characteristics,.merged_characteristics).csv
        │   └── DR-AG_deb-training(.sha256,.characteristics,.merged_characteristics).csv
        │   └── DR-AG_deb-training.goodmal.csv
        │   └── DR-AG_deb-test.goodmal.csv
        ├── VS-AG_deb
        │   ├── VS-AG_deb-test(.sha256,.characteristics,.merged_characteristics).csv
        │   └── VS-AG_deb-training(.sha256,.characteristics,.merged_characteristics).csv
        │   └── VS-AG_deb-training.goodmal.csv
        │   └── VS-AG_deb-test.goodmal.csv
        ├── VS-AG_deb-04
        │   ├── VS-AG_deb-04-test(.sha256,.characteristics,.merged_characteristics).csv
        │   └── VS-AG_deb-04-training(.sha256,.characteristics,.merged_characteristics).csv
        │   └── VS-AG_deb-04-training.goodmal.csv
        │   └── VS-AG_deb-04-test.goodmal.csv
        ...

More information about how these datasets have been constructed is given in the paper and this README.

Datasets

We provide each dataset as a list of hashes in a file and some additional information such as if an APK is a malware or not for mixed datasets. As the primary intent of this work is to debias datasets, we do not need (nor provide) the APKs themselves. Nevertheless, one can recover all these datasets’ content with helper scripts, as explained at the end of this document.

All dataset information is located in the files of the datasets/ folder.

File structure

file.sha256.txt: hashes of the applications of the dataset
file.characteristics.csv: the characteristics for each SHA-256 hash (DroidLysis only)
file.merged_characteristics.csv: the characteristics for each SHA-256 hash (DroidLysis + FalDroid)
file.goodmal.csv: the information about the class (goodware=0 or malware=1) when the dataset is mixed. This file is optional when the dataset is a full goodware or malware file.

The header of the characteristics.csv file is:

sha256,APK size,Year,Internet Permission,External storage,Uses Play Services,Generates UUIDs,Vibrate phone,NFC,Bluetooth,Uses HTTP,Uses JSON,Specify User-Agent,apk_size,dex_date,year,minSdkVersion,targetSdkVersion,android.permission.READ_PHONE_STATE,android.permission.READ_CONTACTS,android.permission.READ_SMS,android.permission.CAMERA,android.permission.RECORD_AUDIO,android.permission.READ_EXTERNAL_STORAGE,android.permission.READ_HISTORY_BOOKMARKS,android.permission.ACCESS_NETWORK_STATE,android.permission.ACCESS_WIFI_STATE,android.permission.GET_TASKS,android.permission.ACTIVITY_RECOGNITION,android.permission.INTERNET,android.permission.SEND_SMS,android.permission.CALL_PHONE,android.permission.READ_CALL_LOG,android.permission.BLUETOOTH_ADMIN,android.permission.BLUETOOTH,android.permission.BODY_SENSORS,android.permission.GET_ACCOUNTS,android.permission.WRITE_EXTERNAL_STORAGE,android.permission.NFC,android.permission.WRITE_CONTACTS,android.permission.WRITE_SMS,android.permission.MOUNT_FORMAT_FILESYSTEMS,android.permission.CHANGE_NETWORK_STATE,android.permission.CHANGE_WIFI_STATE,android.permission.REORDER_TASKS,android.permission.WAKE_LOCK,android.permission.REBOOT,android.permission.KILL_BACKGROUND_PROCESSES,android.permission.INSTALL_PACKAGES,android.permission.REQUEST_INSTALL_PACKAGES,android.permission.INJECT_EVENTS,android.permission.SYSTEM_ALERT_WINDOW,abort_broadcast,accessibility_service,account_pwd,airplane,android_id,andy,answer_call,apkprotect,base64,battery,bluestacks,board,bookmarks,bootloader,brand,busybox,calendar,call,call_log,camera,check_permission,contacts,cookie_manager,cpu_abi,crc32,c2dm,debugger,device_admin,dex_class_loader,dex_file,dhcp_server,dns,doze_mode,email,emulator,encryption,end_call,execute_native,fingerprint,genymotion,get_accounts,get_active_network_info,get_external_storage_stage,get_imei,get_imsi,get_installed_packages,get_installer_package_name,get_line_number,get_mac,get_network_operator,get_package_info,get_sim_country_iso,get_sim_operator,get_sim_serial_number,get_sim_slot_index,gps,gzip,hardware,hide_softkeyboard,http,intent_chooser,ip_address,ip_properties,javascript,jni,json,keyguard,kill_proc,link_speed,load_dex,load_library,logcat,manufacturer,microphone,model,nop,nox,obfuscation,open_non_asset,package_delete,package_sig,pangxie,password,phone_number,play_protect,post,product,receive_sms,record,reflection,ringer,rooting,rssi,scp,search_url,send_sms,sensor,set_component,shortcut,socket,ssh,ssid,stacktrace,su,substrate,system_app,tasks,uri,url_history,user_agent,uuid,version,vibrate,vnd_package,wakelock,wallpaper,webview,wifi,zip,am_start_elsewhere,android_wear,coinhive,cryptocurrency,cryptoloot,c2_anon,gps_elsewhere,javascript_html_load,jni_onload,has_phonenumbers,has_url,ip_address_elsewhere,kill_elsewhere,miner,mms,play_protect_elsewhere,play_services,pm_install_elsewhere,qemu,screen_on_off,sfr,su_exector,systemprop,ch***,exec,shell,mounts,geteuid,adb,pm_install,pm_list,am_broadcast,am_start,kill,ptrace,proc_version,possible_exploit,ragecage,exploid,zerg,levitator,mempodroid,towelroot,supersu,dalvikvm,dexclassloader,loadclass,url_in_exec,mtk_su

Two versions of the characteristics files (filenames contain eiter .characteristics or .merged_characteristics) are given for the mixed datasets. This is because we added some extra characteristics from the FalDroid tool (merged file). These extra files only exists for mixed datasets because we only computed these characteristoics for machine learning experiments. This is explained later in this readme file (section Including extra features from FalDroid).

Malware datasets:

The datasets of the paper correspond to the files:

Drebin: Drebin
AMD: AMD
VirusShare 2015: VirusShare_2015
VirusShare 2016: VirusShare_2016
VirusShare 2017: VirusShare_2017
VirusShare 2018: VirusShare_2018
ACT-M: ACT-M
AZL-M: AZL-M

Androzoo extracts:

The datasets of the paper correspond to the files:

AZ19 100k: AZ19_100k
AZ19 100k 2015: AZ19_100k_2015
AZ19 100k 2016: AZ19_100k_2016
AZ19 100k 2017: AZ19_100k_2017
AZ19 100k 2018: AZ19_100k_2018
AZ20 10k: AZ20_10k
AZ20 20k: AZ20_20k
AZ20 30k: AZ20_30k

Note that a few applications have been removed from these extracts as analysis tools like apktool fail to analyze these apps.

Goodware datasets:

The datasets of the paper correspond to the files:

NAZE-18-G: NAZE-18-G
NAZE-18-G_Debiased: NAZE-18-G_deb
AZ19 100k-G: AZ19_100k-G
ACT-G: ACT-G
AZL-G: AZL-G

DroidBench:

A micro-benchmark suite to assess the stability of taint-analysis tools for Android.

DroidBench: DroidBench

Debiased malware datasets:

The datasets of the paper correspond to the files:

Drebin debiased (Drebin_deb):

delta = 0.04: Drebin_deb-04
delta = 0.02: Drebin_deb-02
delta = 0.01: Drebin_deb-01

VirusShare debiased (VS15-18_deb):

delta = 0.04: VS15-18_deb-04
delta = 0.02: VS15-18_deb-02
delta = 0.01: VS15-18_deb-01
delta = 0.005: VS15-18_deb-005

VirusShare 2015 debiased (VS15_deb):

delta = 0.04: VS15_deb-04
delta = 0.02: VS15_deb-02

VirusShare 2016 debiased (VS16_deb):

delta = 0.04: VS16_deb-04

VirusShare 2017 debiased (VS17_deb):

delta = 0.04: VS17_deb-04

VirusShare 2018 debiased (VS18_deb):

delta = 0.04: VS18_deb-04
delta = 0.02: VS18_deb-02

Mixed datasets:

These datasets contain the additional file file.goodmal.csv that informs about the goodware/malware status of an APK.

D_mix

The datasets of the paper correspond to the files:

D_mix: D_mix

Drebin_Debiased + NAZE_Debiased-18-G

These datasets have been built to be directly usable for machine learning algorithms. For downloading them, you can go to the end of this document. Downloading all APKs of these datasets is not required to execute the debiasing algorithms.

The datasets of the paper correspond to the files:

DR-AG_deb:

training: DR-AG_deb-training
test: DR-AG_deb-test
goodware/malware information: DR-AG_deb-training.goodmal.csv, DR-AG_deb-test.goodmal.csv

DR-AG-C2_deb: (with C2 constraint)

training: DR-AG-C2_deb-training
test: DR-AG-C2_deb-test
goodware/malware information: DR-AG-C2_deb-training.goodmal.csv, DR-AG-C2_deb-test.goodmal.csv

VS_Debiased-15-18 + NAZE_Debiased-18-G

VS-AG_deb:

training: VS-AG_deb-training
test: VS-AG_deb-test
goodware/malware information: VS-AG_deb-training.goodmal.csv, VS-AG_deb-test.goodmal.csv

VS-AG-C2_deb: (with C2 constraint)

training: VS-AG-C2_deb-training
test: VS-AG-C2_deb-test
goodware/malware information: VS-AG-C2_deb-training.goodmal.csv, VS-AG-C2_deb-test.goodmal.csv

VS_Debiased-15-18-04 + NAZE_Debiased-18-G-01

VS-AG_deb,delta=0.04:

training: VS-AG_deb-04-training
test: VS-AG_deb-04-test
goodware/malware information: VS-AG_deb-04-training.goodmal.csv, VS-AG_deb-04-test.goodmal.csv

Drebin + AZ19 100k

training: DR-AG-training
goodware/malware information: DR-AG-training.goodmal.csv
No test set.

VS 15-18 + AZ19 100k

training: VS-AG-training
test: VS-AG-test
goodware/malware information: VS-AG-training.goodmal.csv, VS-AG-test.goodmal.csv

AndroCT (ACT)

2014:

Training sets:

training: ACT14-training
test: ACT14-test
goodware/malware information: ACT14-training.goodmal.csv, ACT14-test.goodmal.csv

2017:

training: ACT17-training
test: ACT17-test
goodware/malware information: ACT17-training.goodmal.csv, ACT17-test.goodmal.csv

AZ20 30k with labels

This datasets includes an additional file that states the goodware/malware info, extracted from AndroZoo.

2014:

training: AZL14-training
test: AZL14-test
goodware/malware information: AZL14-training.goodmal.csv, AZL14-test.goodmal.csv

2017:

training: AZL17-training
test: AZL17-test
goodware/malware information: AZL17-training.goodmal.csv, AZL17-test.goodmal.csv

Usage of debiasing and evaluation scripts

To replay the dataset debiasing process (or try with a new arrangment of datasets), scripts are provided for this mean. For debiasing, we need to generate the possible classes. Classes are defined by all the observed combinations of characteristics.

Requirements

First, install the packages using the requirements.txt.

pip3 install -r requirements.txt

Generation of classes

In order to perform the debiasing algorithm, multiple subfolders are required to generate new files that contain the classes (combinations of characteristics) and which APK are in these classes. These subfolders are generated into input_datasets.

To generate these subfolders, use the following command:

python3 csv_to_combinations_hashes.py config.datasets.original.ini datasets

For example, for drebin, the script generates:

Drebin.characteristics.csv
Drebin.features_specs.classcount.csv
Drebin.features_specs.combination_hashes.json

the .characteristics.csv file is a copy of the original file located in the datasets/ folder
the .classcount.csv file indicates the number of APK that matches a combination of characteristic (a class). For example, for Drebin, the class “0,2,1,1,0,0,0,0,0,0,1,0” contains 139 applications.
the .combination_hashes.json contains a dictionary that associates for each combination of characteristics (a class) the list of SHA-256 APK files

This information can be used later more efficiently when debiasing datasets.

Notice that the configuration file config.datasets.original.ini is copied as config.datasets.ini. This last file will be used as the working configuration file for the rest of the README.

Single dataset debiasing algorithm

For debiasing one base dataset (or an union of base datasets), with a target dataset and a list of source datasets, the user should use the datasets files’ basename in the dataset folder. The command line looks like the following:

python3 debias_script.py config.datasets.ini \
    new_dataset_name \
    "Name of the new dataset" \
    --base-datasets base_dataset_list \
    --target-dataset target_dataset_name \
    --source-datasets list \
    --delta value

with delta the value in [0,1] for controlling the distance of the output with the target dataset.

Four outputs are expected in out/new_dataset_name and are similar to the outputs of the generation of classes:

the file new_dataset_name.characteristics.csv: the characteristics.
the file new_dataset_name.features_specs.classcount.csv: the number of APK for each combination of characteristics (class).
the file new_dataset_name.features_specs.combination_hashes.json: contains a dictionary that associates a class with all SHA256 APKs.
the file new_dataset_name.features_specs.dataset_class_info.json: some general information about the experiment:
- size: size of the dataset
- modified: true if this dataset has been generated
- base dataset list and original size
- target dataset
- source dataset list and size
- delta
- number of combinations (classes)
- debiasable: false if the debiasing algorithm fails
- added: the number of APK added from the source in this new dataset
- removed: the number of APK removed from the base dataset
- d_min final: the d_min value at the end in the paper algorithm
- percent modifs: ratio of additions and removals of APKs over the size of the base datasets, between 0 and 1
- add ratio: ratio of additions of new APK over the size of the generated dataset, between 0 and 1
- run time: the duration of performing the debiasing algorithm over this configuration

For example, for the Drebin dataset as input, with AZ20_30k as target dataset, with AMD and VirusShare-201{5,6,7,8} as source datasets, and for a delta of 0.04, the user should launch:

python3 debias_script.py config.datasets.ini \
    Drebin_deb-01-replay \
    "Drebin_deb-01 (Replay)" \
    --base-datasets Drebin \
    --target-dataset AZ20_30k \
    --source-datasets AMD VirusShare_201{5,6,7,8} \
    --delta "0.01"

We call this experiment “Replay” because the user replays the debiasing algorithm and should obtain similar results as the already provided dataset datasets/Drebin_deb-04.

Replaying this experiment generates in the folder out/Drebin_deb-04-replay/ the files:

Drebin_deb-01-replay.characteristics.csv
Drebin_deb-01-replay.features_specs.classcount.csv
Drebin_deb-01-replay.features_specs.combination_hashes.json
Drebin_deb-01-replay.features_specs.dataset_class_info.json
Drebin_deb-01-replay.sha256.txt

In particular, in the class_info file, we note that:

103 apps have been added
4596 apps have been removed
the add ratio is 12.7%
the final dataset size is 811

Even though the algorithm will generate a different dataset each time, the number of elements per class is the same in every re-run with the same base, target and source datasets, and delta. To verify this, the user can check the difference of the classcount.csv files between the original and the replay:

diff <(sort input_datasets/Drebin_deb-01/Drebin_deb-01.features_specs.classcount.csv) \
     <(sort out/Drebin_deb-01-replay/Drebin_deb-01-replay.features_specs.classcount.csv)

If the debiasing algorithm fails, several solutions can be tested:

increase the delta value to let the output be farther from the target
provide more samples in the source datasets (probably some classes do not have enough applications)

Comparing datasets with a population

When new datasets are generated, or with the input datasets, the user may want to evaluate the distance between these datasets and an extract of the population. In particular, we provide a script to evaluate the Chi2 and the p-value, using the following command:

python3 count_analysis.py config.datasets.ini 
    --population population_dataset \
    --datasets list of datasets to evaluate \
    --append-population-name \
    --filename "filename"

The parameters are: - population: indicates the name of the dataset to use as an extract of the considered population - datasets: a list of dataset names that can be located both in the datasets/ or out/ folders. - (optional) --append-population-name: add the name of the population in the output file - (optional) --filename: the name of the output file

The outputs are:

out/count_analysis_output/filename_population_dataset.xlsx: a tabular containing the comparison of the considered datasets (Chi2, p-value, added/removed app count, etc.)
out/count_analysis_output/filename_population_dataset.tex: a latex tabular containing max delta and the size of datasets.

These outputs, in particular the latex output, can be customized to your needs.

For example, for comparing the following three datasets with the extract of AndroZoo of size 30k extracted in 2020 (AZ20_30k):

Drebin: the original Drebin dataset
Drebin_deb-01: the debiased dataset already computed and dropped in the datasets/ folder
Drebin_deb-01-replay: the debiased dataset just generated by following this README

The user should use the following command:

python3 count_analysis.py config.datasets.ini \
    --population AZ20_30k \
    --datasets Drebin Drebin_deb-01{,-replay} \
    --append-population-name --filename "Drebin_debias_replay"

As shown in the output of the script, Drebin_deb-01 and the replay (Drebin_deb-01-replay) have the same Chi2 value, which is expected. The file out/count_analysis_output/Drebin_debias_replay_AZ20_30k.xlsx contains a table with information about Drebin and the provided debiased dataset and the new generated one:

Count analysis result

Mix dataset debiasing algorithm

For producing mixed datasets, we provide a script that takes two datasets as input: one should contain the malware, the other the goodware.

The command is the following:

python3 gen_dataset_by_parts_subset_year_c.py config.datasets.ini \
    "id_name_of_the_generated_dataset" "Full name of the generated dataset" \
    id_debiased_malware_dataset \
    id_debiased_goodware_dataset \
    year-time-barrier_training-test \
    --date-fix year_fix.sha256.csv

with the parameters:

id_debiased_malware_dataset: the id of the malware dataset that will be loaded from folder datasets/ and out/.
id_debiased_goodware_dataset: the id of the goodware dataset that will be loaded from folder datasets/ and out/.
year-time-barrier_training-test: an integer representing the year used to split the datasets into the training part and the test part.
(optional) --date-fix year_fix.sha256.csv: a helper file that the user should provide to help the identification of the date of broken APKs. Indeed, some APK has a date of 0 when extracting the date from the APK archive (zip date construction). In this case, the script can open the helper file to search for an alternative date.
(optional) --percent: specify the percent of malware applications for the output test dataset (the default is 5%)

For example, for mixing the “debiased Drebin” dataset just created before (DR-AG_deb-replay) and the “debiased NAZE” dataset, and for using 2013 as a barrier for delimitating the training set and the test set, the user should do:

python3 gen_dataset_by_parts_subset_year_c.py config.datasets.ini \
    "DR-AG-C2_deb-01" "DR-AG-C2_deb-01" \
    Drebin_deb-01-replay \
    NAZE-18-G_deb-001 \
    2013

To leave the C2 condition out, add the “no-balance-time-window” option:

python3 gen_dataset_by_parts_subset_year_c.py config.datasets.ini \
    "DR-AG_deb-01" "DR-AG_deb-01" \
    Drebin_deb-01-replay \
    NAZE-18-G_deb-001 \
    2013 \
    --no-balance-time-window

For specifying 10% of malware, add the “percent” option followed by 10:

python3 gen_dataset_by_parts_subset_year_c.py config.datasets.ini \
    "DR-AG-C2_deb-01" "DR-AG-C2_deb-01" \
    Drebin_deb-01-replay \
    NAZE-18-G_deb-001 \
    2013 \
    --percent 10

This script outputs two folders, one for the training set, one for the test set. The content of this folder is similar to the debiasing of a single dataset. For example, the mixing of Drebin and Naze generates:

out/DR-AG-C2_deb-01-replay-training
- DR-AG-C2_deb-01-replay-training.characteristics.csv
- DR-AG-C2_deb-01-replay-training.features_specs.classcount.csv
- DR-AG-C2_deb-01-replay-training.features_specs.combination_hashes.json
- DR-AG-C2_deb-01-replay-training.features_specs.dataset_class_info.json
out/DR-AG-C2_deb-01-replay-test-5p
- DR-AG-C2_deb-01-replay-test-5p.characteristics.csv
- DR-AG-C2_deb-01-replay-test-5p.features_specs.classcount.csv
- DR-AG-C2_deb-01-replay-test-5p.features_specs.combination_hashes.json
- DR-AG-C2_deb-01-replay-test-5p.features_specs.dataset_class_info.json

Comparing the intersection of two datasets

To count the number of elements in these replays and the original ones, count_apps_per_date_by_single_dataset.py can be used for this purpose:

python3 count_apps_per_date_by_single_dataset.py config.datasets.ini \
    --source-datasets list of datasets used for mixing \
    --datasets mixed datasets \
    --date-fix year_fix.sha256.csv

The source datasets are the ones that have been used for producing the mixed datasets. The script helps to control the good balance of applications in the produced mixed datasets.

For example, for analysing the mixed dataset “Drebin debiased” and “NAZE debiased”, the user should do:

python3 count_apps_per_date_by_single_dataset.py config.datasets.ini \
    --source-datasets Drebin_deb-01-replay NAZE-18-G_deb-001 \
    --datasets DR-AG-C2_deb-01-training DR-AG-C2_deb-01-test-5p

The output shows that the test set does not contain any application from debias-drebin or debias-goodware before 2013. Indeed, the test set should start for years greater than 2013. We also show that the training set is balanced between goodware and malware for each year.

For example, the training set contains the following (malware/goodware balanced) for the last available year:

Total for 2012: 
+--------------------------+--------------------------+---------------------+---------+
|                          |   Drebin_deb-01 (Replay) |   NAZE-18-G_deb-001 |   Total |
|--------------------------+--------------------------+---------------------+---------|
| DR-AG-C2_deb-01-training |                       94 |                  94 |     188 |
+--------------------------+--------------------------+---------------------+---------+

And the test set contains the following (5% malware) for the first available year:

Total for 2014:
+-------------------------+--------------------------+---------------------+---------+
|                         |   Drebin_deb-01 (Replay) |   NAZE-18-G_deb-001 |   Total |
|-------------------------+--------------------------+---------------------+---------|
| DR-AG-C2_deb-01-test-5p |                        2 |                  38 |      40 |
+-------------------------+--------------------------+---------------------+---------+

Notice that, because the hashes in the debiased datasets are different most of the time, the results shown may differ from the ones obtained with a new “Drebin debiased” and “NAZE debiased”. However, using the same datasets as inputs (the ones generated in the previous section), but with a different “id” and “name”, the result mix dataset will have the same number of hashes.

Performing all debiasing experiments

Scripts are provided to repoduce the experiments found in the paper. To do this, first launch the following script if you have not already, it will create all the necessary folders in order to continue:

python3 csv_to_combinations_hashes.py config.datasets.original.ini datasets

For reproducing all experiments produced in Table II, the user can launch the following script:

python3 redo_table_II.py

For reproducing all experiments produced in Table III, after reproducing the ones of Table II, the user can do:

bash redo_table_III.sh

Including extra features from FalDroid

For more information about generating additional features using FalDroid, please go to this repository.

After the output arff files are generated, they must be tranformed to the proper .merged_characteristics.csv to be used for ML experiments. To do this, the script arff_to_csv.py transformes arff files to .graph_characteristics.csv. Then, the script merge_characteristics_faldroid.sh will join these with the respective .characteristics.csv file to generate a .merged_characteristics.csv file. This last type of files can be used with ML experiments (see section Machine Learning Experiments)

Downloading APK datasets

We cannot provide the samples directly in this zip archive, as our institution does not allow us to do so. Nevertheless, we provide scripts to recover them from the sha256.txt files.

Goodware datasets

Goodware datasets can be downloaded from AndroZoo, using the script “get_apk_from_androzoo.py”:

python3 get_apk_from_androzoo.py 
usage: get_apk_from_androzoo.py [-h] api_key_file hash_list_file output_dir

For example for Drebin:

python3 get_apk_from_androzoo.py api_key_androzoo api_key_virusshare datasets/drebin.sha256.txt tmp
Num hashes: 5304
sha256 to download: a7f5522c5775945950aab6531979c78fd407238131fabd94a0cb47343a402f91
Done
...

Malware datasets

Malware datasets can be partially found in AndroZoo. Drebin and AMD are available, but all VirusShare datasets should be downloaded from the VirusShare website.

Mixed datasets

The mixed datasets can be fully downloaded from AndroZoo:

Usage: python3 download_mixed_datasets.py api_key_androzoo api_key_virusshare outdir

python3 download_mixed_datasets.py api_key_androzoo ./

The script creates the following tree and populates them:

        .
        ├── DR-AG_deb
        │   ├── DR-AG_deb-test_no-bal-time-win
        │   └── DR-AG_deb-training_no-bal-time-win
        ├── DR-AG-C2_deb
        │   ├── DR-AG-C2_deb-test
        │   └── DR-AG-C2_deb-training
        ├── VS-AG_deb
        │   ├── VS-AG_deb-test_no-bal-time-win
        │   └── VS-AG_deb-training_no-bal-time-win
        ├── VS-AG-C2_deb
        │   ├── VS-AG-C2_deb-test_no-bal-time-win
        │   └── VS-AG-C2_deb-training_no-bal-time-win
        ├── VS-AG_deb-04
        │   ├── VS-AG_deb-04-test_no-bal-time-win
        │   └── VS-AG_deb-04-training_no-bal-time-win
        │
        ...

The script can be interrupted and you can relaunch the download.

Machine Learning experiments

For redoing ML experiments, please see the dedicated README.

Dataset Files

dada-dataset-v3.1.zip (Size: 150.01 MB)

Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.

Datasets

Open Access

Abstract

Instructions:

Quick-start for using the output datasets for your own experiment

Datasets

File structure

Malware datasets:

Androzoo extracts:

Goodware datasets:

DroidBench:

Debiased malware datasets:

Drebin debiased (Drebin_deb):

VirusShare debiased (VS15-18_deb):

VirusShare 2015 debiased (VS15_deb):

VirusShare 2016 debiased (VS16_deb):

VirusShare 2017 debiased (VS17_deb):

VirusShare 2018 debiased (VS18_deb):

Mixed datasets:

D_mix

Drebin_Debiased + NAZE_Debiased-18-G

VS_Debiased-15-18 + NAZE_Debiased-18-G

VS_Debiased-15-18-04 + NAZE_Debiased-18-G-01

Drebin + AZ19 100k

VS 15-18 + AZ19 100k

AndroCT (ACT)

AZ20 30k with labels

Usage of debiasing and evaluation scripts

Requirements

Generation of classes

Single dataset debiasing algorithm

Comparing datasets with a population

Mix dataset debiasing algorithm

Comparing the intersection of two datasets

Performing all debiasing experiments

Including extra features from FalDroid

Downloading APK datasets

Goodware datasets

Malware datasets

Mixed datasets

Machine Learning experiments

Dataset Files

DATASET SCRIPTS

DOCUMENTATION

QUESTIONS?

More like this Dataset

IoT network intrusion dataset

MQTT-IoT-IDS2020: MQTT Internet of Things Intrusion Detection Dataset

Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications: Centralized and Federated Learning

UAV Attack Dataset

1.55M API IMPORT DATASET for MALWARE ANALYSIS

DroneDetect Dataset: A Radio Frequency dataset of Unmanned Aerial System (UAS) Signals for Machine Learning Detection & Classification