These datasets report data of 64 Force Sensing Resistors at multiple voltages. It was foun that the input voltage can be used to trim sensors' sensitivity and ultimately to reduce dispersion. The DMAIC cycle was used to reduce process variability on the basis of the Six Sigma Methodology. The zip folder contains:

1) a Matlab file for loading the data

2) four .txt files with the experimental data of Force Sensing resistors


This dataset was produced as a part of my PhD research on Android malware detection using Multimodal Deep Learning. It contains raw data (DEX grayscale images), static analysis data (Android Intents & Permissions), and dynamic analysis data (system call sequences). For the conference research paper, please refer to




Field Name Field Type Input Domain
SHA256 String 32 bytes
DEX_PIXEL_0, ..., DEX_PIXEL_16383 Integer {0, 1, ..., 255}
INTENT_0, ..., INTENT_99 Integer {0, 1}
PERMISSION_0, ..., PERMISSION_99 Integer {0, 1}
SYSCALL_0, ..., SYSCALL_399 Integer {0, 1, ..., 123}
CLASS Integer {0 = Goodware, 1 = Malware}

intents = ['android.intent.action.main', 'android.intent.action.boot_completed', 'android.intent.action.view', 'android.intent.action.user_present', 'android.intent.action.package_added', 'android.intent.action.package_removed', 'android.intent.action.phone_state', '', 'android.intent.action.package_replaced', 'android.intent.action.create_shortcut', 'android.intent.action.new_outgoing_call', 'android.intent.action.action_power_connected', 'android.intent.action.action_power_disconnected', 'android.intent.action.quickboot_poweron', 'android.intent.action.send', 'android.intent.action.data_sms_received', 'android.intent.action.media_mounted', 'android.intent.action.download_complete', 'android.intent.action.screen_on', 'android.intent.action.media_button', 'android.intent.action.action_shutdown', 'android.intent.action.media_eject', 'android.intent.action.media_unmounted', 'android.intent.action.sim_state_changed', 'android.intent.action.any_data_state', 'android.intent.action.battery_changed', 'android.intent.action.download_notification_clicked', 'android.intent.action.package_install', 'android.intent.action.media_removed', 'android.intent.action.delete', 'android.intent.action.time_set', 'android.intent.action.service_state', 'android.intent.action.media_checking', 'android.intent.action.sendto', 'android.intent.action.timezone_changed', 'android.intent.action.screen_off', 'android.intent.action.date_changed', 'android.intent.action.pick', 'android.intent.action.package_restarted', 'android.intent.action.send_multiple', 'android.intent.action.my_package_replaced', 'android.intent.action.get_content', 'android.intent.action.notification_add', 'android.intent.action.notification_remove', 'android.intent.action.notification_update', 'android.intent.action.battery_low', 'android.intent.action.respond_via_message', 'android.intent.action.set_wallpaper', 'android.intent.action.edit', 'android.intent.action.battery_okay', 'android.intent.action.airplane_mode', 'android.intent.action.locale_changed', 'android.intent.action.package_changed', 'android.intent.action.headset_plug', 'android.intent.action.sig_str', 'android.intent.action.action_external_applications_available', 'android.intent.action.action_date_changed', 'android.intent.action.action_time_changed', 'android.intent.action.action_media_eject', 'android.intent.action.action_package_added', 'android.intent.action.action_timezone_changed', 'android.intent.action.time_tick', 'android.intent.action.action_view_downloads', 'android.intent.action.close_system_dialogs', 'android.intent.action.web_search', 'android.intent.action.chinamobile_oms_game', 'android.intent.action.reboot', 'android.intent.action.dial', 'android.intent.action.media_scanner_finished', 'android.intent.action.action_package_changed', 'android.intent.action.package_data_cleared', 'android.intent.action.media_search', 'android.intent.action.assist', '', 'android.intent.action.call_button', 'android.intent.action.wallpaper_changed', 'android.intent.action.quickboot_poweroff', 'android.intent.action.close_system_alarm', 'android.intent.action.insert', 'android.intent.action.media_bad_removal', 'android.intent.action.search_long_press', 'android.intent.action.default', 'android.intent.action.music_player', 'android.intent.action.ums_connected', 'android.intent.action.external_applications_available', 'android.intent.action.media_shared', 'android.intent.action.call_privileged', '', 'android.intent.action.camsnap', 'android.intent.action.device_storage_low', 'android.intent.action.manage_network_usage', 'android.intent.action.videocap', 'android.intent.action.camera_button', 'android.intent.action.package_fully_removed', 'android.intent.action.proxy_change', 'android.intent.action.plug_in_airing', 'android.intent.action.set_alarm', 'android.intent.action.device_storage_ok', 'android.intent.action.media_scanner_started', 'android.intent.action.ringtone_picker']

permissions = ['android.permission.internet', 'android.permission.access_network_state', 'android.permission.write_external_storage', 'android.permission.read_phone_state', 'android.permission.access_wifi_state', 'android.permission.wake_lock', 'android.permission.access_coarse_location', 'android.permission.vibrate', 'android.permission.access_fine_location', 'android.permission.receive_boot_completed', 'android.permission.get_tasks', 'android.permission.get_accounts', 'android.permission.system_alert_window', 'android.permission.read_external_storage', 'android.permission.change_wifi_state', 'android.permission.send_sms', '', 'android.permission.write_settings', 'android.permission.mount_unmount_filesystems', 'android.permission.receive_sms', 'android.permission.call_phone', 'android.permission.read_sms', 'android.permission.read_contacts', 'android.permission.record_audio', 'android.permission.read_logs', 'android.permission.change_network_state', 'android.permission.restart_packages', 'android.permission.disable_keyguard', 'android.permission.modify_audio_settings', 'android.permission.write_sms', 'android.permission.access_location_extra_commands', 'android.permission.bluetooth', 'android.permission.use_credentials', 'android.permission.set_wallpaper', 'android.permission.flashlight', 'android.permission.broadcast_sticky', 'android.permission.write_contacts', 'android.permission.process_outgoing_calls', 'android.permission.kill_background_processes', 'android.permission.bluetooth_admin', 'android.permission.manage_accounts', 'android.permission.receive_user_present', 'android.permission.change_configuration', 'android.permission.install_packages', 'android.permission.access_mock_location', 'android.permission.download_without_notification', 'android.permission.write_apn_settings', 'android.permission.read_call_log', 'android.permission.receive_mms', 'android.permission.access_gps', 'android.permission.read_calendar', 'android.permission.access_download_manager', 'android.permission.authenticate_accounts', 'android.permission.baidu_location_service', 'android.permission.write_calendar', 'android.permission.system_overlay_window', 'android.permission.battery_stats', 'android.permission.delete_packages', 'android.permission.modify_phone_state', 'android.permission.get_package_size', 'android.permission.clear_app_cache', 'android.permission.receive_wap_push', 'android.permission.write_call_log', 'android.permission.write_secure_settings', 'android.permission.access_coarse_updates', 'android.permission.record_video', 'android.permission.interact_across_users_full', 'android.permission.read_settings', 'android.permission.read_profile', 'android.permission.set_wallpaper_hints', 'android.permission.expand_status_bar', 'android.permission.call_privileged', 'android.permission.change_component_enabled_state', 'android.permission.device_power', 'android.permission.write_sync_settings', 'android.permission.reorder_tasks', 'android.permission.read_sync_settings', 'android.permission.nfc', 'android.permission.change_wifi_multicast_state', 'android.permission.write_owner_data', 'android.permission.set_debug_app', 'android.permission.broadcast_sms', 'android.permission.package_usage_stats', 'android.permission.write_internal_storage', 'android.permission.broadcast_package_added', 'android.permission.broadcast_package_replaced', 'android.permission.broadcast_package_install', 'android.permission.access_location', 'android.permission.broadcast_package_changed', 'android.permission.access_mtk_mmhw', 'android.permission.read_owner_data', 'android.permission.manage_documents', 'android.permission.access_superuser', 'android.permission.write_media_storage', 'android.permission.update_device_stats', 'android.permission.access_assisted_gps', 'android.permission.read_sync_stats', 'android.permission.raised_thread_priority', 'android.permission.persistent_activity', 'android.permission.mout_unmount_filesystems']

syscalls = ['UNK', 'accept', 'access', 'bind', 'brk', 'cacheflush', 'capset', 'chdir', 'chmod', 'clock_gettime', 'clone', 'close', 'connect', 'dup', 'dup2', 'epoll_create', 'epoll_ctl', 'epoll_wait', 'execve', 'exit', 'exit_group', 'fchmod', 'fchown32', 'fcntl', 'fcntl64', 'fdatasync', 'fgetxattr', 'flock', 'fork', 'fsetxattr', 'fstat64', 'fsync', 'ftruncate', 'ftruncate64', 'futex', 'getcwd', 'getdents64', 'getegid32', 'geteuid32', 'getgid32', 'getgroups32', 'getpgid', 'getpid', 'getppid', 'getpriority', 'getresgid32', 'getresuid32', 'getrlimit', 'getsockname', 'getsockopt', 'gettid', 'gettimeofday', 'getuid32', 'inotify_add_watch', 'inotify_init', 'inotify_rm_watch', 'ioctl', 'kill', 'listen', 'lseek', 'lstat64', 'madvise', 'mkdir', 'mmap2', 'mprotect', 'mremap', 'msync', 'munmap', 'nanosleep', 'open', 'pciconfig_iobase', 'personality', 'pipe', 'poll', 'prctl', 'pread', 'ptrace', 'pwrite', 'read', 'readlink', 'recvfrom', 'recvmsg', 'rename', 'restart_syscall', 'rmdir', 'rt_sigreturn', 'rt_sigtimedwait', 'sched_getparam', 'sched_getscheduler', 'sched_yield', 'select', 'sendmsg', 'sendto', 'set_tls', 'setgid32', 'setgroups32', 'setitimer', 'setpgid', 'setpriority', 'setresuid32', 'setrlimit', 'setsid', 'setsockopt', 'setuid32', 'shutdown', 'sigaction', 'sigprocmask', 'sigreturn', 'socket', 'socketpair', 'stat64', 'statfs', 'statfs64', 'tgkill', 'timerfd', 'timerfd_settime', 'umask', 'uname', 'unlink', 'utimes', 'vfork', 'wait4', 'write', 'writev']


We would like to thank Universidade Nove de Julho and the Coordination for the Improvement of Higher Education Personnel (CAPES) for supporting this research.


The Dada dataset is associated with the paper “Debiasing Android Malware Datasets: How can I trust your results if your dataset is biased?”. The goal of this dataset is to provide a new updated dataset of goodware/malware applications that can be used by other researchers for performing experiments, for example, detection or classification algorithms. The dataset contains the applications hashes and some characteristics.


Quick-start for using the output datasets for your own experiment

If you just want to use the mixed datasets (goodware/malware) for your experiments, you should do:

python3 api_key_androzoo ./

with api_key_androzoo being your API key file provided by the team administrating Androzoo. This script downloads applications from AndroZoo, according to the result of debiasing Drebin/VirusShare mixed with Naze. This result is cached for you.

Two datasets are provided:

  • DN: a debiased version of Drebin mixed with goodware from Androzoo (called Naze)
  • VSN: a debiased version of VirusShare mixed with goodware from Androzoo (called Naze)

├── DN
│   ├── drebin_debiased-naze_debiased-test-10.0p
│   ├── drebin_debiased-naze_debiased-test-5p
│   └── drebin_debiased-naze_debiased-training
├── VSN
│   ├── vs15-18_debiased-naze_debiased-2017-test-10.0p
│   ├── vs15-18_debiased-naze_debiased-2017-test-5p
│   └── vs15-18_debiased-naze_debiased-2017-training

More information about how these datasets have been constructed is given in the paper and this README.


We provide each dataset as a list of hashes in a file and some additional information such as if an APK is a malware or not for mixed datasets. As the primary intent of this work is to debias datasets, we do not need (nor provide) the APKs themselves. Nevertheless, one can recover all these datasets’ content with helper scripts, as explained at the end of this document.

All dataset information is located in the files of the datasets/ folder.

File structure

  • file.sha256.txt: hashes of the applications of the dataset
  • file.characteristics.csv: the characteristics for each SHA-256 hash
  • file.goodmal.csv: the information about the class (goodware=0 or malware=1) when the dataset is mixed. This file is optional when the dataset is a full goodware or malware file.

The header of the characteristics.csv file is:

sha256,date,year,APK size,Personal information,Leak information,Phone integrity,Denial of service,Intrusion

Malware datasets:

The datasets of the paper correspond to the files:

  • drebin: drebin
  • AMD: amd
  • VS 2015: virusshare-2015
  • VS 2016: virusshare-2016
  • VS 2017: virusshare-2017
  • VS 2018: virusshare-2018

Androzoo extracts:

The datasets of the paper correspond to the files:

  • AZ19_100k: androzoo-100k
  • AZ19_100k 2015: androzoo-100k-2015
  • AZ19_100k 2016: androzoo-100k-2016
  • AZ19_100k 2017: androzoo-100k-2017
  • AZ19_100k 2018: androzoo-100k-2018
  • AZ20 10k: androzoo-10k-2020
  • AZ20 20k: androzoo-20k-2020
  • AZ20 30k: androzoo-30k-2020

Note that a few applications have been removed from these extracts as analysis tools like apktool fail to analyze these apps.

Goodware datasets:

The datasets of the paper correspond to the files:

  • NAZE-18-G: goodware-2018
  • NAZE_Debiased-18-G: debias-goodware-2018-to-30k-0025

Debiased malware datasets:

The datasets of the paper correspond to the files:

  • Drebin_Debiased: debias-drebin-to-30k-0025
  • VS_Debiased-15-18: debias-vs15-18-to-30k-02

For delta in 0.{0025,005, 01, 02, 04}:

  • VS_Debiased-15: debias-vs2015-to-az100k-2015-delta
  • VS_Debiased-16: debias-vs2016-to-az100k-2016-delta
  • VS_Debiased-17: debias-vs2017-to-az100k-2017-delta
  • VS_Debiased-18: debias-vs2018-to-az100k-2018-delta

Mixed dataset:

These datasets contain the additional file file.goodmal.csv.


The datasets of the paper correspond to the files:

  • Dmix: mix-drebin-two-third-goodware

Drebin_Debiased + NAZE_Debiased-18-G

These datasets have been built to be directly usable for machine learning algorithms. For downloading them, you can go to the end of this document. Downloading all APKs of these datasets is not required to execute the debiasing algorithms.

The datasets of the paper correspond to the files:

Training sets:

  • DN50-NoC2: drebin_debiased-naze_debiased-training_no-bal-time-win
  • DN50: drebin_debiased-naze_debiased-training

Test sets:

  • DN5-NoC2: drebin_debiased-naze_debiased-test-5p_no-bal-time-win
  • DN5: drebin_debiased-naze_debiased-test-5p
  • DN10-NoC2: drebin_debiased-naze_debiased-test-10.0p_no-bal-time-win
  • DN10: drebin_debiased-naze_debiased-test-10.0p

Goodware/Malware information:

  • DN50-NoC2: drebin_debiased-naze_debiased-training_no-bal-time-win.goodmal.csv
  • DN50: drebin_debiased-naze_debiased-training.goodmal.csv
  • DN5-NoC2: drebin_debiased-naze_debiased-test-5p_no-bal-time-win.goodmal.csv
  • DN5: drebin_debiased-naze_debiased-test-5p.goodmal.csv
  • DN10-NoC2: drebin_debiased-naze_debiased-test-10.0p_no-bal-time-win.goodmal.csv
  • DN10: drebin_debiased-naze_debiased-test-10.0p.goodmal.csv

VS_Debiased-15-18 + NAZE_Debiased-18-G

Training sets:

  • VSN50-NoC2: vs15-18_debiased-naze_debiased-2017-training_no-bal-time-win
  • VSN50: vs15-18_debiased-naze_debiased-2017-training

Test sets:

  • VSN5-NoC2: vs15-18_debiased-naze_debiased-2017-test-5p_no-bal-time-win
  • VSN5: vs15-18_debiased-naze_debiased-2017-test-5p
  • VSN10-NoC2: vs15-18_debiased-naze_debiased-2017-test-10.0p_no-bal-time-win
  • VSN10: vs15-18_debiased-naze_debiased-2017-test-10.0p

Goodware/Malware information:

  • VSN50-NoC2: vs15-18_debiased-naze_debiased-2017-training_no-bal-time-win.goodmal.csv
  • VSN50: vs15-18_debiased-naze_debiased-2017-training.goodmal.csv
  • VSN5-NoC2: vs15-18_debiased-naze_debiased-2017-test-5p_no-bal-time-win.goodmal.csv
  • VSN5: vs15-18_debiased-naze_debiased-2017-test-5p.goodmal.csv
  • VSN10-NoC2: vs15-18_debiased-naze_debiased-2017-test-10.0p_no-bal-time-win.goodmal.csv
  • VSN10: vs15-18_debiased-naze_debiased-2017-test-10.0p.goodmal.csv

Usage of debiasing and evaluation scripts

To replay the dataset debiasing process (or try with a new arrangment of datasets), scripts are provided for this mean. For debiasing, we need to generate the possible classes. Classes are defined by all the observed combinations of characteristics.


First, install the packages using the requirements.txt.

pip3 install -r requirements.txt

Generation of classes

In order to perform the debiasing algorithm, multiple subfolders are required to generate new files that contain the classes (combinations of characteristics) and which APK are in these classes. These subfolders are generated into “input_datasets”.

To generate these subfolders, use the following command:

python3 config.datasets.ini datasets

For example, for drebin, the script generates:


  • the characteristics.csv file is a copy of the original file located in the datasets/ folder
  • the classcount.csv file indicates the number of APK that matches a combination of characteristic (a class). For example, for Drebin, the class “1,1,1,0,0,0” contains 543 applications.
  • the combination_hashes.json contains a dictionary that associates for each combination of characteristics (a class) the list of sha256 APK files

This information can be used later more efficiently when debiasing datasets.

Single dataset debiasing algorithm

For debiasing one base dataset (or an union of base datasets), with a target dataset and a list of source datasets, the user should use the datasets files’ basename in the dataset folder. The command line looks like the following:

python3 config.datasets.ini \
new_dataset_name \
"Name of the new dataset" \
--base-datasets base_dataset_list \
--target-dataset target_dataset_name \
--source-datasets list \
--delta value

with delta the value in [0,1] for controlling the distance of the output with the target dataset.

Four outputs are expected in out/new_dataset_name and are similar to the outputs of the generation of classes:

  • the file new_dataset_name.characteristics.csv: the characteristics.
  • the file new_dataset_name.features_specs.classcount.csv: the number of APK for each combination of characteristics (class).
  • the file new_dataset_name.features_specs.combination_hashes.json: contains a dictionary that associates a class with all SHA256 APKs.
  • the file new_dataset_name.features_specs.dataset_class_info.json: some general information about the experiment:
    • size: size of the dataset
    • modified: true if this dataset has been generated
    • base dataset and original size
    • target dataset
    • source dataset list
    • delta
    • number of combinations (classes)
    • combination not found: the classes that are empty: we cannot found any APK representing this class
    • debiasable: false if the debiasing algorithm fails
    • added: the number of APK added from the source in this new dataset
    • removed: the number of APK removed from the base dataset
    • d_min final: the d_min value at the end in the paper algorithm
    • add ratio: ratio of addition of new APK over the size of the generated dataset, between 0 and 1
    • run time: the duration of the debiasing algorithm

For example, for the Drebin dataset as input, with androzoo-30k-2020 as target dataset, with amd and virusshare-201{5,6,7,8} as source datasets, and for a delta of 0.04, the user should launch:

python3 config.datasets.ini \
debias-drebin-to-30k-04-replay \
"Replay (0.04) Debiased Drebin --> AndroZoo 30k (2020)" \
--base-datasets drebin \
--target-dataset androzoo-30k-2020 \
--source-datasets amd virusshare-201{5,6,7,8} \
--delta "0.04"

We call this experiment “Replay” because the user replays the debiasing algorithm and should obtain similar results as the already provided dataset datasets/debias-drebin-to-30k-04.

Replaying this experiment generates in the folder out/debias-drebin-to-30k-04-replay/ the files:


In particular, in the class_info file, we note that:

  • 886 apps have been added
  • 2421 apps have been removed
  • the add ratio is 23.5%
  • the final dataset size is 3769

Even though the algorithm will generate a different dataset each time, the number of elements per class is the same in every re-run with the same base, target and source datasets, and delta. To verify this, the user can check the difference of the classcount.csv files between the original and the replay:

diff <(sort input_datasets/debias-drebin-to-30k-04/debias-drebin-to-30k-04.features_specs.classcount.csv) \
<(sort out/debias-drebin-to-30k-04-replay/debias-drebin-to-30k-04-replay.features_specs.classcount.csv)

If the debiasing algorithm fails, several solutions can be tested:

  • increase the delta value to let the output be farther from the target
  • provide more samples in the source datasets (probably some classes do not have enough applications)

Comparing datasets with a population

When new datasets are generated, or with the input datasets, the user may want to evaluate the distance between these datasets and an extract of the population. In particular, we provide a script to evaluate the Chi2 and the p-value, using the following command:

python3 config.datasets.ini
--population population_dataset \
--datasets list of datasets to evaluate \
--append-population-name \
--filename "filename"

The parameters are: - population: indicates the name of the dataset to use as an extract of the considered population - datasets: a list of dataset names that can be located both in the datasets/ or out/ folders. - append-population-name option: add the name of the population in the output file - filename: the name of the output file

The outputs are:

  • out/count_analysis_output/filename_population_dataset.xlsx: a tabular containing the comparison of the considered datasets (Chi2, p-value, added/removed app count, etc.)
  • out/count_analysis_output/filename_population_dataset.tex: a latex tabular containing max delta and the size of datasets.

These outputs, in particular the latex output, can be customized to your needs.

For example, for comparing the following three datasets with the extract of AndroZoo of size 30k extracted in 2020 (androzoo-30k-2020):

  • drebin: the original Drebin dataset
  • debias-drebin-to-30k-04: the debiased dataset already computed and dropped in the datasets/ folder
  • debias-drebin-to-30k-04-replay: the debiased dataset just generated by following this README

The user should use the following command:

python3 config.datasets.ini \
--population androzoo-30k-2020 \
--datasets drebin debias-drebin-to-30k-04{,-replay} \
--append-population-name --filename "drebin_debias_replay"

As shown in the output of the script, debias-drebin-to-30k-04 and the replay (debias-drebin-to-30k-04-replay) have the same Chi2 value, which is expected. The file out/count_analysis_output/drebin_debias_replay_androzoo-30k-2020.xlsx contains a table with information about Drebin and the provided debiased dataset and the new generated one:

Count analysis result

Mix dataset debiasing algorithm

For producing mixed datasets, we provide a script that takes two datasets as input: one should contain the malware, the other the goodware.

The command is the following:

python3 config.datasets.ini \
"id_name_of_the_generated_dataset" "Full name of the generated dataset" \
id_debiased_malware_dataset \
id_debiased_goodware_dataset \
year-time-barrier_training-test \
--date-fix sha256.dex_date.vt_date.txt

with the parameters:

  • id_debiased_malware_dataset: the id of the malware dataset that will be loaded from folder datasets/ and out/.
  • id_debiased_goodware_dataset: the id of the goodware dataset that will be loaded from folder datasets/ and out/.
  • year-time-barrier_training-test: an integer representing the year used to split the datasets into the training part and the test part.
  • date-fix sha256.dex_date.vt_date.txt: a helper file that the user should provide to help the identification of the date of broken APKs. Indeed, some APK has a date of 0 when extracting the date from the APK archive (zip date construction). In this case, the script can open the helper file to search for an alternative date.
  • (optional) percent: specify the percent of malware applications for the output test dataset (the default is 5%)

For example, for mixing the “debiased Drebin” dataset just created before (drebin_debiased-naze_debiased-replay) and the “debiased NAZE” dataset, and for using 2013 as a barrier for delimitating the training set and the test set, the user should do:

python3 config.datasets.ini \
"drebin_debiased-naze_debiased-replay" "Drebin Debiased replay + NAZE Debiased (Replay)" \
debias-drebin-to-30k-04-replay \
debias-goodware-2018-to-30k-0025 \
2013 \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt

To leave the C2 condition out, add the “no-balance-time-window” option:

python3 config.datasets.ini \
"drebin_debiased-naze_debiased-replay" "Drebin Debiased replay + NAZE Debiased (Replay)" \
debias-drebin-to-30k-04-replay \
debias-goodware-2018-to-30k-0025 \
2013 \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt \

For specifying 10% of malware, add the “percent” option followed by 10:

python3 config.datasets.ini \
"drebin_debiased-naze_debiased-replay" "Drebin Debiased replay + NAZE Debiased (Replay)" \
debias-drebin-to-30k-04-replay \
debias-goodware-2018-to-30k-0025 \
2013 \
--percent 10 \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt

This script outputs two folders, one for the training set, one for the test set. The content of this folder is similar to the debiasing of a single dataset. For example, the mixing of Drebin and Naze generates:

  • out/drebin_debiased-naze_debiased-replay-training
    • drebin_debiased-naze_debiased-replay-training.characteristics.csv
    • drebin_debiased-naze_debiased-replay-training.features_specs.classcount.csv
    • drebin_debiased-naze_debiased-replay-training.features_specs.combination_hashes.json
    • drebin_debiased-naze_debiased-replay-training.features_specs.dataset_class_info.json
  • out/drebin_debiased-naze_debiased-replay-test-5p
    • drebin_debiased-naze_debiased-replay-test-5p.characteristics.csv
    • drebin_debiased-naze_debiased-replay-test-5p.features_specs.classcount.csv
    • drebin_debiased-naze_debiased-replay-test-5p.features_specs.combination_hashes.json
    • drebin_debiased-naze_debiased-replay-test-5p.features_specs.dataset_class_info.json

Comparing the intersection of two datasets

To count the number of elements in these replays and the original ones, can be used for this purpose:

python3 config.datasets.ini \
--source-datasets list of datasets used for mixing \
--datasets mixed datasets \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt

The source datasets are the ones that have been used for producing the mixed datasets. The script helps to control the good balance of applications in the produced mixed datasets.

For example, for analysing the mixed dataset “debiased Drebin” and “debiased NAZE”, the user should do:

python3 config.datasets.ini \
--source-datasets debias-drebin-to-30k-04-replay debias-goodware-2018-to-30k-0025 \
--datasets drebin_debiased-naze_debiased-replay-training drebin_debiased-naze_debiased-replay-test-5p \
--date-fix naze_debiased-sources.sha256.dex_date.vt_date.txt

The output shows that the test set does not contain any application from debias-drebin or debias-goodware before 2013. Indeed, the test set should start for years greater than 2013. We also show that the training set is balanced between goodware and malware for each year.

For example, the training set contains the following (malware/goodware balanced) for the last available year:

Total for 2013:
| | Replay (0.04) Debiased Dr | (0.0025) Debiased AndroZo |
| | ebin --> AndroZoo 30k (20 | o Goodware (2018) --> And |
| | 20) | roZoo 30k (2020) |
| Drebin Debiased replay + NAZE Debiased (Replay)-training | 81 | 81 |

And the test set contains the following (5% malware) for the first available year:

Total for 2014:
| | Replay (0.04) Debiased Dr | (0.0025) Debiased AndroZo |
| | ebin --> AndroZoo 30k (20 | o Goodware (2018) --> And |
| | 20) | roZoo 30k (2020) |
| Drebin Debiased replay + NAZE Debiased (Replay)-test-5p | 8 | 169 |

Notice that, because the hashes in the debiased datasets are different most of the time, the results shown may differ from the ones obtained with a new “debiased Drebin” and “debiased NAZE”. However, using the same datasets as inputs (the ones generated in the previous section), but with a different “id” and “name”, the result mix dataset will have the same number of hashes.

Performing all debiasing experiments

For reproducing all experiments produced in Table II, the user can launch the following script:


For reproducing all experiments produced in Table III, after reproducing the ones of Table II, the user can do:


Downloading APK datasets

We cannot provide the samples directly in this zip archive, as our institution does not allow us to do so. Nevertheless, we provide scripts to recover them from the sha256.txt files.

Goodware datasets

Goodware datasets can be downloaded from AndroZoo, using the script “”:

usage: [-h] api_key_file hash_list_file output_dir

For example for Drebin:

python3 api_key_androzoo datasets/drebin.sha256.txt tmp
Num hashes: 5304
sha256 to download: a7f5522c5775945950aab6531979c78fd407238131fabd94a0cb47343a402f91

Malware datasets

Malware datasets can be partially found in AndroZoo. Drebin and AMD are available, but all VirusShare datasets should be downloaded from the VirusShare website.

Mixed datasets

The mixed datasets can be fully downloaded from AndroZoo:

Usage: python3 api_key_androzoo outdir

python3 api_key_androzoo ./

The script creates the following tree and populates them:

├── DN
│   ├── drebin_debiased-naze_debiased-test-10.0p
│   ├── drebin_debiased-naze_debiased-test-5p
│   └── drebin_debiased-naze_debiased-training
├── DN-NoC2
│   ├── drebin_debiased-naze_debiased-test-10.0p_no-bal-time-win
│   ├── drebin_debiased-naze_debiased-test-5p_no-bal-time-win
│   └── drebin_debiased-naze_debiased-training_no-bal-time-win
├── VSN
│   ├── vs15-18_debiased-naze_debiased-2017-test-10.0p
│   ├── vs15-18_debiased-naze_debiased-2017-test-5p
│   └── vs15-18_debiased-naze_debiased-2017-training
└── VSN-NoC2
├── vs15-18_debiased-naze_debiased-2017-test-10.0p_no-bal-time-win
├── vs15-18_debiased-naze_debiased-2017-test-5p_no-bal-time-win
└── vs15-18_debiased-naze_debiased-2017-training_no-bal-time-win

The script can be interrupted and you can relaunch the download.


Simple text file obtained from manually scraping the web for the question "What is Machine Learning?".

The files contain the first paragraph/ page on the website's approach to answer the question. This data is not used for commercial purposes and is available to all.

This data is used in TAES to show how it can be used for plagiarism checking. The text files (*.txt) contain plain text and need no preprocessing to use. Simply read the file and assign the data to a string object. 


This dataset is taken from 20 subjects over a duration of 1 hour where experiments were done on the upper body bio-impedance with the following objectives:

a)     Evaluate the effect of externally induced perturbance at the SE interface caused by motion, applied pressure, temperature variation and posture change on bio-impedance measurements.

b)     Evaluate the degree of distortion due to artefact at multiple frequencies (10kHz-100kHz) in the bio-impedance measurements.


The PD-BioStampRC21 dataset provides data from a wearable sensoraccelerometry study conducted for studying activity, gait, tremor, andother motor symptoms in individuals with Parkinson's disease (PD).  Inaddition to individuals with PD, the dataset also includes data forcontrols that also went through the same study protocol as the PDparticipants.  Data were acquired using lightweight MC 10 BioStamp RCsensors (MC 10 Inc, Lexington, MA), five of which were attached toeach participant for gathering data over a roughly two dayinterval.


Users of the dataset should cite the following paper:

Jamie L. Adams, Karthik Dinesh, Christopher W. Snyder, Mulin Xiong,Christopher G. Tarolli, Saloni Sharma, E. Ray Dorsey, Gaurav Sharma,"A real-world study of wearable sensors in Parkinson’sdisease". Accepted for publication at npj Parkinsons Disease, 2021, toappear.

An overview of the study protocol is also provided in the abovementioned paper. Additional detail specific to the dataset and filenaming conventions is provided here.

The dataset is comprised of two main components: (I) Sensor andUPDRS-assessment-task annotation data for each participant and (II)demographic and clinical assessment data for all participants. Each ofthese is described in turn below:

I) Sensor and UPDRS-assessment-task annotation data:

The sensor accelerometry and UPDRS-assessment-task annotation data forall the participants are provided as a zip file The size of the zip file is 11GB and,when unzipped, it generates a set of folders and files with a totalsize of approximately 56GB. Unzipping the file generates folders withname matching the participant ID for each of the Control and PDparticipants (17 Control + 17 PD). Each participant folder containsthe data organized as the following files.

a) Accelerometer sensor data files (CSV) corresponding to the fivedifferent sensor placement locations, which are abbreviated as:  

1) Trunk (chest)           - abbreviated as "ch"  

2) Left anterior thigh     - abbreviated as "ll"  

3) Right anterior thigh    - abbreviated as "rl"  

4) Left anterior forearm   - abbreviated as "lh"  

5) Right anterior forearm  - abbreviated as "rh"   

Example file name for accelerometer sensor data files:   "AbbreviatedSensorLocation"_ID"ParticipantID"Accel.csv   E.g. ch_ID018Accel.csv, ll_ID018Accel.csv, rl_ID018Accel.csv,   lh_ID018Accel.csv, and rh_ID018Accel.csv  

File format for the accelerometer sensor data files: Comprises of four columns that provide a timestamp for each measurement and   corresponding triaxial accelerometry relative to the sensor   coordinate system.     

Column 1: "Timestamp (ms)" - Time in milliseconds  

Column 2: "Accel X (g)"    - Acceleration in X-direction (in units of g = 9.8 m/s^2)

   Column 3: "Accel Y (g)"    - Acceleration in Y-direction (in units of g = 9.8 m/s^2)

   Column 4: "Accel Z (g)"    - Acceleration in Z-direction (in units of g = 9.8 m/s^2)

   Times and timestamps are consistently reported in units of   milliseconds starting from the instant of the earliest sensor   recording (for the first sensor applied to the participant).

b) Annotation file (CSV). This file provides tagging annotations for   the sensor data that identify, via start and end timestamps, the   durations of various clinical assessments performed in the study.   

   Example file name for annotation file: AnnotID"ParticipantID".csv   E.g. AnnotID018.csv   

   File format for the annotation file: Comprises of four columns

   Column 1: "Event Type"           - List of in-clinic MDS-UPDRS assessments. Each assessment comprises of                                       two queries -  medication status and MDS-UPDRS assessment body locations

   Column 2: "Start Timestamp (ms)" - Start timestamp for the MDS-UPDRS assessments

   Column 3: "Stop Timestamp (ms)"  - Stop timestamp for the MDS-UPDRS assessments

   Column 4: "Value"                - Responses to the queries in Column 1 - medication status (OFF/ON) and                                       MDS-UPDRS assessment body locations (E.g. RIGHT HAND, NECK, etc.)   

II) Demographic and clinical assessment data

For all participants, the demographic and clinical assessment data areprovided as a zip file "". Unzippingthe file generates a CSV file named Clinic_Data_PD-BioStampRC21.csv

File format for the demographic and clinical assessment data file: Comprises of 19 columns

Column 1: "ID"                                               - Participant ID

Column 2: "Sex"                                              - Participant sex (Male/Female)

Column 3: "Status"                                           - Participant disease status (PD/Control)

Column 4: "Age"                                              - Participant age

Column 5: "updrs_3_17a"                                      - Rest tremor amplitude (RUE - Right Upper Extremity)

Column 6: "updrs_3_17b"                                      - Rest tremor amplitude (LUE - Left Upper Extremity)

Column 7: "updrs_3_17c"                                      - Rest tremor amplitude (RLE - Right Lower Extremity)

Column 8: "updrs_3_17d"                                      - Rest tremor amplitude (LLE - Right Lower Extremity)

Column 9: "updrs_3_17e"                                      - Rest tremor amplitude (Lip/Jaw)

Column 10 - Column 14: "updrs_3_17a_off" - "updrs_3_17e_off" - Rest tremor amplitude during OFF medication assessment                                                                (ordering similar as that from Column 5 to Column 9)

Column 15 - Column 19: "updrs_3_17a_on" - "updrs_3_17e_on"   - Rest tremor amplitude during ON medication assessment

Note that columns 10-19 do not contain any data for controlparticipants and for PD participants that did not participate in theON/OFF medication component of the assessment protocol for the study.

For details about different MDS-UPDRS assessments and scoring schemes, the reader is referred to:        

Goetz, C. G. et al. Movement Disorder Society-sponsored revision ofthe Unified Parkinson's Disease Rating Scale (MDS-UPDRS): scalepresentation and clinimetric testing results. Mov Disord 23,2129-2170, doi:10.1002/mds.22340 (2008)


Amidst the COVID-19 pandemic, cyberbullying has become an even more serious threat. Our work aims to investigate the viability of an automatic multiclass cyberbullying detection model that is able to classify whether a cyberbully is targeting a victim’s age, ethnicity, gender, religion, or other quality. Previous literature has not yet explored making fine-grained cyberbullying classifications of such magnitude, and existing cyberbullying datasets suffer from quite severe class imbalances.


Please cite the following paper when using this open access dataset:

J. Wang, K. Fu, C.T. Lu, “SOSNet: A Graph Convolutional Network Approach to Fine-Grained Cyberbullying Detection,” Proceedings of the 2020 IEEE International Conference on Big Data (IEEE BigData 2020), pp. 1699-1708, December 10-13, 2020.

This is a "Dynamic Query Expansion"-balanced dataset containing .txt files with 8000 tweets for each of a fine-grained class of cyberbullying: age, ethnicity, gender, religion, other, and not cyberbullying.

Total Size: 6.33 MB


Includes some data from:

S. Agrawal and A. Awekar, “Deep learning for detecting cyberbullying across multiple social media platforms,” in European Conference on Information Retrieval. Springer, 2018, pp. 141–153.

U. Bretschneider, T. Wohner, and R. Peters, “Detecting online harassment in social networks,” in ICIS, 2014.

D. Chatzakou, I. Leontiadis, J. Blackburn, E. D. Cristofaro, G. Stringhini, A. Vakali, and N. Kourtellis, “Detecting cyberbullying and cyberaggression in social media,” ACM Transactions on the Web (TWEB), vol. 13, no. 3, pp. 1–51, 2019.

T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated hate speech detection and the problem of offensive language,” arXiv preprint arXiv:1703.04009, 2017.

Z. Waseem and D. Hovy, “Hateful symbols or hateful people? predictive features for hate speech detection on twitter,” in Proceedings of the NAACL student research workshop, 2016, pp. 88–93.

Z. Waseem, “Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter,” in Proceedings of the first workshop on NLP and computational social science, 2016, pp. 138–142.

J.-M. Xu, K.-S. Jun, X. Zhu, and A. Bellmore, “Learning from bullying traces in social media,” in Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: Human language technologies, 2012, pp. 656–666. 


This dataset brings some problem sets and results from some classical algorithms from the evolutionary computational community.

We have used some tools: Pymoo, Platypus and Pagmo


This dataset was collected from force, current, angle (magnetic rotary encoder), and inertial sensors of the NAO humanoid robot while walking on Vinyl, Gravel, Wood, Concrete, Artificial grass, and Asphalt without a slope and while walking on Vinyl, Gravel, and Wood with a slope of 2 degrees. In total, counting all different axes and components of each sensor, we monitored 27 parameters on-board of the robot.


DataSet used in learning process of the traditional technique's operation, considering different devices and scenarios, the proposed approach can adapt its response to the device in use, identifying the MAC layer protocol, perform the commutation through the protocol in use, and make the device to operate with the best possible configuration.