Bangla is one of the most spoken languages in the world but Bangla NLP research is in its early stage of development due to the lack of quality public corpus. In this article, we describe the detailed compilation methodology of a comprehensive monolingual Bangla corpus, KUMono. Thiscorpus consists of more than 353 million word tokens in total as well as more than one million unique tokens from 18 major text categories of online Bangla websites.


Dataset haiving the curvature and spline values for the roundabout


These datasets report data of 64 Force Sensing Resistors at multiple voltages. It was foun that the input voltage can be used to trim sensors' sensitivity and ultimately to reduce dispersion. The DMAIC cycle was used to reduce process variability on the basis of the Six Sigma Methodology. The zip folder contains:

1) a Matlab file for loading the data

2) four .txt files with the experimental data of Force Sensing resistors


This dataset was produced as a part of my PhD research on Android malware detection using Multimodal Deep Learning. It contains raw data (DEX grayscale images), static analysis data (Android Intents & Permissions), and dynamic analysis data (system call sequences). For the conference research paper, please refer to




Field Name Field Type Input Domain
SHA256 String 32 bytes
DEX_PIXEL_0, ..., DEX_PIXEL_16383 Integer {0, 1, ..., 255}
INTENT_0, ..., INTENT_99 Integer {0, 1}
PERMISSION_0, ..., PERMISSION_99 Integer {0, 1}
SYSCALL_0, ..., SYSCALL_399 Integer {0, 1, ..., 123}
CLASS Integer {0 = Goodware, 1 = Malware}

intents = ['android.intent.action.main', 'android.intent.action.boot_completed', 'android.intent.action.view', 'android.intent.action.user_present', 'android.intent.action.package_added', 'android.intent.action.package_removed', 'android.intent.action.phone_state', '', 'android.intent.action.package_replaced', 'android.intent.action.create_shortcut', 'android.intent.action.new_outgoing_call', 'android.intent.action.action_power_connected', 'android.intent.action.action_power_disconnected', 'android.intent.action.quickboot_poweron', 'android.intent.action.send', 'android.intent.action.data_sms_received', 'android.intent.action.media_mounted', 'android.intent.action.download_complete', 'android.intent.action.screen_on', 'android.intent.action.media_button', 'android.intent.action.action_shutdown', 'android.intent.action.media_eject', 'android.intent.action.media_unmounted', 'android.intent.action.sim_state_changed', 'android.intent.action.any_data_state', 'android.intent.action.battery_changed', 'android.intent.action.download_notification_clicked', 'android.intent.action.package_install', 'android.intent.action.media_removed', 'android.intent.action.delete', 'android.intent.action.time_set', 'android.intent.action.service_state', 'android.intent.action.media_checking', 'android.intent.action.sendto', 'android.intent.action.timezone_changed', 'android.intent.action.screen_off', 'android.intent.action.date_changed', 'android.intent.action.pick', 'android.intent.action.package_restarted', 'android.intent.action.send_multiple', 'android.intent.action.my_package_replaced', 'android.intent.action.get_content', 'android.intent.action.notification_add', 'android.intent.action.notification_remove', 'android.intent.action.notification_update', 'android.intent.action.battery_low', 'android.intent.action.respond_via_message', 'android.intent.action.set_wallpaper', 'android.intent.action.edit', 'android.intent.action.battery_okay', 'android.intent.action.airplane_mode', 'android.intent.action.locale_changed', 'android.intent.action.package_changed', 'android.intent.action.headset_plug', 'android.intent.action.sig_str', 'android.intent.action.action_external_applications_available', 'android.intent.action.action_date_changed', 'android.intent.action.action_time_changed', 'android.intent.action.action_media_eject', 'android.intent.action.action_package_added', 'android.intent.action.action_timezone_changed', 'android.intent.action.time_tick', 'android.intent.action.action_view_downloads', 'android.intent.action.close_system_dialogs', 'android.intent.action.web_search', 'android.intent.action.chinamobile_oms_game', 'android.intent.action.reboot', 'android.intent.action.dial', 'android.intent.action.media_scanner_finished', 'android.intent.action.action_package_changed', 'android.intent.action.package_data_cleared', 'android.intent.action.media_search', 'android.intent.action.assist', '', 'android.intent.action.call_button', 'android.intent.action.wallpaper_changed', 'android.intent.action.quickboot_poweroff', 'android.intent.action.close_system_alarm', 'android.intent.action.insert', 'android.intent.action.media_bad_removal', 'android.intent.action.search_long_press', 'android.intent.action.default', 'android.intent.action.music_player', 'android.intent.action.ums_connected', 'android.intent.action.external_applications_available', 'android.intent.action.media_shared', 'android.intent.action.call_privileged', '', 'android.intent.action.camsnap', 'android.intent.action.device_storage_low', 'android.intent.action.manage_network_usage', 'android.intent.action.videocap', 'android.intent.action.camera_button', 'android.intent.action.package_fully_removed', 'android.intent.action.proxy_change', 'android.intent.action.plug_in_airing', 'android.intent.action.set_alarm', 'android.intent.action.device_storage_ok', 'android.intent.action.media_scanner_started', 'android.intent.action.ringtone_picker']

permissions = ['android.permission.internet', 'android.permission.access_network_state', 'android.permission.write_external_storage', 'android.permission.read_phone_state', 'android.permission.access_wifi_state', 'android.permission.wake_lock', 'android.permission.access_coarse_location', 'android.permission.vibrate', 'android.permission.access_fine_location', 'android.permission.receive_boot_completed', 'android.permission.get_tasks', 'android.permission.get_accounts', 'android.permission.system_alert_window', 'android.permission.read_external_storage', 'android.permission.change_wifi_state', 'android.permission.send_sms', '', 'android.permission.write_settings', 'android.permission.mount_unmount_filesystems', 'android.permission.receive_sms', 'android.permission.call_phone', 'android.permission.read_sms', 'android.permission.read_contacts', 'android.permission.record_audio', 'android.permission.read_logs', 'android.permission.change_network_state', 'android.permission.restart_packages', 'android.permission.disable_keyguard', 'android.permission.modify_audio_settings', 'android.permission.write_sms', 'android.permission.access_location_extra_commands', 'android.permission.bluetooth', 'android.permission.use_credentials', 'android.permission.set_wallpaper', 'android.permission.flashlight', 'android.permission.broadcast_sticky', 'android.permission.write_contacts', 'android.permission.process_outgoing_calls', 'android.permission.kill_background_processes', 'android.permission.bluetooth_admin', 'android.permission.manage_accounts', 'android.permission.receive_user_present', 'android.permission.change_configuration', 'android.permission.install_packages', 'android.permission.access_mock_location', 'android.permission.download_without_notification', 'android.permission.write_apn_settings', 'android.permission.read_call_log', 'android.permission.receive_mms', 'android.permission.access_gps', 'android.permission.read_calendar', 'android.permission.access_download_manager', 'android.permission.authenticate_accounts', 'android.permission.baidu_location_service', 'android.permission.write_calendar', 'android.permission.system_overlay_window', 'android.permission.battery_stats', 'android.permission.delete_packages', 'android.permission.modify_phone_state', 'android.permission.get_package_size', 'android.permission.clear_app_cache', 'android.permission.receive_wap_push', 'android.permission.write_call_log', 'android.permission.write_secure_settings', 'android.permission.access_coarse_updates', 'android.permission.record_video', 'android.permission.interact_across_users_full', 'android.permission.read_settings', 'android.permission.read_profile', 'android.permission.set_wallpaper_hints', 'android.permission.expand_status_bar', 'android.permission.call_privileged', 'android.permission.change_component_enabled_state', 'android.permission.device_power', 'android.permission.write_sync_settings', 'android.permission.reorder_tasks', 'android.permission.read_sync_settings', 'android.permission.nfc', 'android.permission.change_wifi_multicast_state', 'android.permission.write_owner_data', 'android.permission.set_debug_app', 'android.permission.broadcast_sms', 'android.permission.package_usage_stats', 'android.permission.write_internal_storage', 'android.permission.broadcast_package_added', 'android.permission.broadcast_package_replaced', 'android.permission.broadcast_package_install', 'android.permission.access_location', 'android.permission.broadcast_package_changed', 'android.permission.access_mtk_mmhw', 'android.permission.read_owner_data', 'android.permission.manage_documents', 'android.permission.access_superuser', 'android.permission.write_media_storage', 'android.permission.update_device_stats', 'android.permission.access_assisted_gps', 'android.permission.read_sync_stats', 'android.permission.raised_thread_priority', 'android.permission.persistent_activity', 'android.permission.mout_unmount_filesystems']

syscalls = ['UNK', 'accept', 'access', 'bind', 'brk', 'cacheflush', 'capset', 'chdir', 'chmod', 'clock_gettime', 'clone', 'close', 'connect', 'dup', 'dup2', 'epoll_create', 'epoll_ctl', 'epoll_wait', 'execve', 'exit', 'exit_group', 'fchmod', 'fchown32', 'fcntl', 'fcntl64', 'fdatasync', 'fgetxattr', 'flock', 'fork', 'fsetxattr', 'fstat64', 'fsync', 'ftruncate', 'ftruncate64', 'futex', 'getcwd', 'getdents64', 'getegid32', 'geteuid32', 'getgid32', 'getgroups32', 'getpgid', 'getpid', 'getppid', 'getpriority', 'getresgid32', 'getresuid32', 'getrlimit', 'getsockname', 'getsockopt', 'gettid', 'gettimeofday', 'getuid32', 'inotify_add_watch', 'inotify_init', 'inotify_rm_watch', 'ioctl', 'kill', 'listen', 'lseek', 'lstat64', 'madvise', 'mkdir', 'mmap2', 'mprotect', 'mremap', 'msync', 'munmap', 'nanosleep', 'open', 'pciconfig_iobase', 'personality', 'pipe', 'poll', 'prctl', 'pread', 'ptrace', 'pwrite', 'read', 'readlink', 'recvfrom', 'recvmsg', 'rename', 'restart_syscall', 'rmdir', 'rt_sigreturn', 'rt_sigtimedwait', 'sched_getparam', 'sched_getscheduler', 'sched_yield', 'select', 'sendmsg', 'sendto', 'set_tls', 'setgid32', 'setgroups32', 'setitimer', 'setpgid', 'setpriority', 'setresuid32', 'setrlimit', 'setsid', 'setsockopt', 'setuid32', 'shutdown', 'sigaction', 'sigprocmask', 'sigreturn', 'socket', 'socketpair', 'stat64', 'statfs', 'statfs64', 'tgkill', 'timerfd', 'timerfd_settime', 'umask', 'uname', 'unlink', 'utimes', 'vfork', 'wait4', 'write', 'writev']


We would like to thank Universidade Nove de Julho and the Coordination for the Improvement of Higher Education Personnel (CAPES) for supporting this research.


The Dada dataset is associated with the paper “Debiasing Android Malware Datasets: How can I trust your results if your dataset is biased?”. The goal of this dataset is to provide a new updated dataset of goodware/malware applications that can be used by other researchers for performing experiments, for example, detection or classification algorithms. The dataset contains the applications hashes and some characteristics.



The Dada dataset is associated with the paper “Debiasing Android Malware Datasets: How can I trust your results if your dataset is biased?”. The goal of this dataset is to provide a new updated dataset of goodware/malware applications that can be used by other researchers for performing experiments, for example, detection or classification algorithms. The dataset contains the applications hashes and some characteristics. The dataset DOES NOT contain the malware themselves but one can download them with their hash from well-known repositories such as AndroZoo and VirusShare.

We provide well-known old datasets (Drebin, AMD) and several extracts of the AndroZoo and VirusShare repository of applications. We also provide the pre-computed output of the debiasing process of labeled biased datasets modified to resemble an extract of AndroZoo. Researchers can directly use these outputs to perform their own experiments. We also provide the scripts that implement the proposed debiasing algorithm to make our experiments fully reproducible.

Quick-start for using the output datasets for your own experiment

If you just want to use the mixed datasets (goodware/malware) for your experiments, you should do:

python3 api_key_androzoo api_key_virusshare ./

with api_key_androzoo being your API key file provided by the team administrating AndroZoo, and api_key_virusshare the API key file provided by VirusShare respectively. This script downloads applications from AndroZoo, according to the result of debiasing Drebin/VirusShare mixed with Naze. This result is cached for you.

Several datasets are provided in the datasets/ folder, in particular mixed datasets with goodware/malware:

  • DR-AG_deb: a debiased version of Drebin (DR) mixed with goodware from AndroZoo (AG)
  • VS-AG_deb: a debiased version of VirusShare (VS) mixed with goodware from AndroZoo (AG)
  • VS-AG_deb-04: a debiased version of VirusShare (VS) with delta = 0.04 mixed with goodware from AndroZoo (AG)

├── DR-AG_deb
│   ├── DR-AG_deb-test(.sha256,.characteristics,.merged_characteristics).csv
│   └── DR-AG_deb-training(.sha256,.characteristics,.merged_characteristics).csv
│   └── DR-AG_deb-training.goodmal.csv
│   └── DR-AG_deb-test.goodmal.csv
├── VS-AG_deb
│   ├── VS-AG_deb-test(.sha256,.characteristics,.merged_characteristics).csv
│   └── VS-AG_deb-training(.sha256,.characteristics,.merged_characteristics).csv
│   └── VS-AG_deb-training.goodmal.csv
│   └── VS-AG_deb-test.goodmal.csv
├── VS-AG_deb-04
│   ├── VS-AG_deb-04-test(.sha256,.characteristics,.merged_characteristics).csv
│   └── VS-AG_deb-04-training(.sha256,.characteristics,.merged_characteristics).csv
│   └── VS-AG_deb-04-training.goodmal.csv
│   └── VS-AG_deb-04-test.goodmal.csv

More information about how these datasets have been constructed is given in the paper and this README.


We provide each dataset as a list of hashes in a file and some additional information such as if an APK is a malware or not for mixed datasets. As the primary intent of this work is to debias datasets, we do not need (nor provide) the APKs themselves. Nevertheless, one can recover all these datasets’ content with helper scripts, as explained at the end of this document.

All dataset information is located in the files of the datasets/ folder.

File structure

  • file.sha256.txt: hashes of the applications of the dataset
  • file.characteristics.csv: the characteristics for each SHA-256 hash (DroidLysis only)
  • file.merged_characteristics.csv: the characteristics for each SHA-256 hash (DroidLysis + FalDroid)
  • file.goodmal.csv: the information about the class (goodware=0 or malware=1) when the dataset is mixed. This file is optional when the dataset is a full goodware or malware file.

The header of the characteristics.csv file is:

sha256,APK size,Year,Internet Permission,External storage,Uses Play Services,Generates UUIDs,Vibrate phone,NFC,Bluetooth,Uses HTTP,Uses JSON,Specify User-Agent,apk_size,dex_date,year,minSdkVersion,targetSdkVersion,android.permission.READ_PHONE_STATE,android.permission.READ_CONTACTS,android.permission.READ_SMS,android.permission.CAMERA,android.permission.RECORD_AUDIO,android.permission.READ_EXTERNAL_STORAGE,android.permission.READ_HISTORY_BOOKMARKS,android.permission.ACCESS_NETWORK_STATE,android.permission.ACCESS_WIFI_STATE,android.permission.GET_TASKS,android.permission.ACTIVITY_RECOGNITION,android.permission.INTERNET,android.permission.SEND_SMS,android.permission.CALL_PHONE,android.permission.READ_CALL_LOG,android.permission.BLUETOOTH_ADMIN,android.permission.BLUETOOTH,android.permission.BODY_SENSORS,android.permission.GET_ACCOUNTS,android.permission.WRITE_EXTERNAL_STORAGE,android.permission.NFC,android.permission.WRITE_CONTACTS,android.permission.WRITE_SMS,android.permission.MOUNT_FORMAT_FILESYSTEMS,android.permission.CHANGE_NETWORK_STATE,android.permission.CHANGE_WIFI_STATE,android.permission.REORDER_TASKS,android.permission.WAKE_LOCK,android.permission.REBOOT,android.permission.KILL_BACKGROUND_PROCESSES,android.permission.INSTALL_PACKAGES,android.permission.REQUEST_INSTALL_PACKAGES,android.permission.INJECT_EVENTS,android.permission.SYSTEM_ALERT_WINDOW,abort_broadcast,accessibility_service,account_pwd,airplane,android_id,andy,answer_call,apkprotect,base64,battery,bluestacks,board,bookmarks,bootloader,brand,busybox,calendar,call,call_log,camera,check_permission,contacts,cookie_manager,cpu_abi,crc32,c2dm,debugger,device_admin,dex_class_loader,dex_file,dhcp_server,dns,doze_mode,email,emulator,encryption,end_call,execute_native,fingerprint,genymotion,get_accounts,get_active_network_info,get_external_storage_stage,get_imei,get_imsi,get_installed_packages,get_installer_package_name,get_line_number,get_mac,get_network_operator,get_package_info,get_sim_country_iso,get_sim_operator,get_sim_serial_number,get_sim_slot_index,gps,gzip,hardware,hide_softkeyboard,http,intent_chooser,ip_address,ip_properties,javascript,jni,json,keyguard,kill_proc,link_speed,load_dex,load_library,logcat,manufacturer,microphone,model,nop,nox,obfuscation,open_non_asset,package_delete,package_sig,pangxie,password,phone_number,play_protect,post,product,receive_sms,record,reflection,ringer,rooting,rssi,scp,search_url,send_sms,sensor,set_component,shortcut,socket,ssh,ssid,stacktrace,su,substrate,system_app,tasks,uri,url_history,user_agent,uuid,version,vibrate,vnd_package,wakelock,wallpaper,webview,wifi,zip,am_start_elsewhere,android_wear,coinhive,cryptocurrency,cryptoloot,c2_anon,gps_elsewhere,javascript_html_load,jni_onload,has_phonenumbers,has_url,ip_address_elsewhere,kill_elsewhere,miner,mms,play_protect_elsewhere,play_services,pm_install_elsewhere,qemu,screen_on_off,sfr,su_exector,systemprop,ch***,exec,shell,mounts,geteuid,adb,pm_install,pm_list,am_broadcast,am_start,kill,ptrace,proc_version,possible_exploit,ragecage,exploid,zerg,levitator,mempodroid,towelroot,supersu,dalvikvm,dexclassloader,loadclass,url_in_exec,mtk_su

Two versions of the characteristics files (filenames contain eiter .characteristics or .merged_characteristics) are given for the mixed datasets. This is because we added some extra characteristics from the FalDroid tool (merged file). These extra files only exists for mixed datasets because we only computed these characteristoics for machine learning experiments. This is explained later in this readme file (section Including extra features from FalDroid).

Malware datasets:

The datasets of the paper correspond to the files:

  • Drebin: Drebin
  • AMD: AMD
  • VirusShare 2015: VirusShare_2015
  • VirusShare 2016: VirusShare_2016
  • VirusShare 2017: VirusShare_2017
  • VirusShare 2018: VirusShare_2018
  • ACT-M: ACT-M
  • AZL-M: AZL-M

Androzoo extracts:

The datasets of the paper correspond to the files:

  • AZ19 100k: AZ19_100k
  • AZ19 100k 2015: AZ19_100k_2015
  • AZ19 100k 2016: AZ19_100k_2016
  • AZ19 100k 2017: AZ19_100k_2017
  • AZ19 100k 2018: AZ19_100k_2018
  • AZ20 10k: AZ20_10k
  • AZ20 20k: AZ20_20k
  • AZ20 30k: AZ20_30k

Note that a few applications have been removed from these extracts as analysis tools like apktool fail to analyze these apps.

Goodware datasets:

The datasets of the paper correspond to the files:

  • NAZE-18-G: NAZE-18-G
  • NAZE-18-G_Debiased: NAZE-18-G_deb
  • AZ19 100k-G: AZ19_100k-G
  • ACT-G: ACT-G
  • AZL-G: AZL-G


A micro-benchmark suite to assess the stability of taint-analysis tools for Android.

  • DroidBench: DroidBench

Debiased malware datasets:

The datasets of the paper correspond to the files:

Drebin debiased (Drebin_deb):

  • delta = 0.04: Drebin_deb-04
  • delta = 0.02: Drebin_deb-02
  • delta = 0.01: Drebin_deb-01

VirusShare debiased (VS15-18_deb):

  • delta = 0.04: VS15-18_deb-04
  • delta = 0.02: VS15-18_deb-02
  • delta = 0.01: VS15-18_deb-01
  • delta = 0.005: VS15-18_deb-005

VirusShare 2015 debiased (VS15_deb):

  • delta = 0.04: VS15_deb-04
  • delta = 0.02: VS15_deb-02

VirusShare 2016 debiased (VS16_deb):

  • delta = 0.04: VS16_deb-04

VirusShare 2017 debiased (VS17_deb):

  • delta = 0.04: VS17_deb-04

VirusShare 2018 debiased (VS18_deb):

  • delta = 0.04: VS18_deb-04
  • delta = 0.02: VS18_deb-02

Mixed datasets:

These datasets contain the additional file file.goodmal.csv that informs about the goodware/malware status of an APK.


The datasets of the paper correspond to the files:

  • D_mix: D_mix

Drebin_Debiased + NAZE_Debiased-18-G

These datasets have been built to be directly usable for machine learning algorithms. For downloading them, you can go to the end of this document. Downloading all APKs of these datasets is not required to execute the debiasing algorithms.

The datasets of the paper correspond to the files:


  • training: DR-AG_deb-training
  • test: DR-AG_deb-test
  • goodware/malware information: DR-AG_deb-training.goodmal.csv, DR-AG_deb-test.goodmal.csv

DR-AG-C2_deb: (with C2 constraint)

  • training: DR-AG-C2_deb-training
  • test: DR-AG-C2_deb-test
  • goodware/malware information: DR-AG-C2_deb-training.goodmal.csv, DR-AG-C2_deb-test.goodmal.csv

VS_Debiased-15-18 + NAZE_Debiased-18-G


  • training: VS-AG_deb-training
  • test: VS-AG_deb-test
  • goodware/malware information: VS-AG_deb-training.goodmal.csv, VS-AG_deb-test.goodmal.csv

VS-AG-C2_deb: (with C2 constraint)

  • training: VS-AG-C2_deb-training
  • test: VS-AG-C2_deb-test
  • goodware/malware information: VS-AG-C2_deb-training.goodmal.csv, VS-AG-C2_deb-test.goodmal.csv

VS_Debiased-15-18-04 + NAZE_Debiased-18-G-01


  • training: VS-AG_deb-04-training
  • test: VS-AG_deb-04-test
  • goodware/malware information: VS-AG_deb-04-training.goodmal.csv, VS-AG_deb-04-test.goodmal.csv

Drebin + AZ19 100k

  • training: DR-AG-training
  • goodware/malware information: DR-AG-training.goodmal.csv
  • No test set.

VS 15-18 + AZ19 100k

  • training: VS-AG-training
  • test: VS-AG-test
  • goodware/malware information: VS-AG-training.goodmal.csv, VS-AG-test.goodmal.csv

AndroCT (ACT)


Training sets:

  • training: ACT14-training
  • test: ACT14-test
  • goodware/malware information: ACT14-training.goodmal.csv, ACT14-test.goodmal.csv


  • training: ACT17-training
  • test: ACT17-test
  • goodware/malware information: ACT17-training.goodmal.csv, ACT17-test.goodmal.csv

AZ20 30k with labels

This datasets includes an additional file that states the goodware/malware info, extracted from AndroZoo.


  • training: AZL14-training
  • test: AZL14-test
  • goodware/malware information: AZL14-training.goodmal.csv, AZL14-test.goodmal.csv


  • training: AZL17-training
  • test: AZL17-test
  • goodware/malware information: AZL17-training.goodmal.csv, AZL17-test.goodmal.csv

Usage of debiasing and evaluation scripts

To replay the dataset debiasing process (or try with a new arrangment of datasets), scripts are provided for this mean. For debiasing, we need to generate the possible classes. Classes are defined by all the observed combinations of characteristics.


First, install the packages using the requirements.txt.

pip3 install -r requirements.txt

Generation of classes

In order to perform the debiasing algorithm, multiple subfolders are required to generate new files that contain the classes (combinations of characteristics) and which APK are in these classes. These subfolders are generated into input_datasets.

To generate these subfolders, use the following command:

python3 config.datasets.original.ini datasets

For example, for drebin, the script generates:


  • the .characteristics.csv file is a copy of the original file located in the datasets/ folder
  • the .classcount.csv file indicates the number of APK that matches a combination of characteristic (a class). For example, for Drebin, the class “0,2,1,1,0,0,0,0,0,0,1,0” contains 139 applications.
  • the .combination_hashes.json contains a dictionary that associates for each combination of characteristics (a class) the list of SHA-256 APK files

This information can be used later more efficiently when debiasing datasets.

Notice that the configuration file config.datasets.original.ini is copied as config.datasets.ini. This last file will be used as the working configuration file for the rest of the README.

Single dataset debiasing algorithm

For debiasing one base dataset (or an union of base datasets), with a target dataset and a list of source datasets, the user should use the datasets files’ basename in the dataset folder. The command line looks like the following:

python3 config.datasets.ini \
new_dataset_name \
"Name of the new dataset" \
--base-datasets base_dataset_list \
--target-dataset target_dataset_name \
--source-datasets list \
--delta value

with delta the value in [0,1] for controlling the distance of the output with the target dataset.

Four outputs are expected in out/new_dataset_name and are similar to the outputs of the generation of classes:

  • the file new_dataset_name.characteristics.csv: the characteristics.
  • the file new_dataset_name.features_specs.classcount.csv: the number of APK for each combination of characteristics (class).
  • the file new_dataset_name.features_specs.combination_hashes.json: contains a dictionary that associates a class with all SHA256 APKs.
  • the file new_dataset_name.features_specs.dataset_class_info.json: some general information about the experiment:
    • size: size of the dataset
    • modified: true if this dataset has been generated
    • base dataset list and original size
    • target dataset
    • source dataset list and size
    • delta
    • number of combinations (classes)
    • debiasable: false if the debiasing algorithm fails
    • added: the number of APK added from the source in this new dataset
    • removed: the number of APK removed from the base dataset
    • d_min final: the d_min value at the end in the paper algorithm
    • percent modifs: ratio of additions and removals of APKs over the size of the base datasets, between 0 and 1
    • add ratio: ratio of additions of new APK over the size of the generated dataset, between 0 and 1
    • run time: the duration of performing the debiasing algorithm over this configuration

For example, for the Drebin dataset as input, with AZ20_30k as target dataset, with AMD and VirusShare-201{5,6,7,8} as source datasets, and for a delta of 0.04, the user should launch:

python3 config.datasets.ini \
Drebin_deb-01-replay \
"Drebin_deb-01 (Replay)" \
--base-datasets Drebin \
--target-dataset AZ20_30k \
--source-datasets AMD VirusShare_201{5,6,7,8} \
--delta "0.01"

We call this experiment “Replay” because the user replays the debiasing algorithm and should obtain similar results as the already provided dataset datasets/Drebin_deb-04.

Replaying this experiment generates in the folder out/Drebin_deb-04-replay/ the files:


In particular, in the class_info file, we note that:

  • 103 apps have been added
  • 4596 apps have been removed
  • the add ratio is 12.7%
  • the final dataset size is 811

Even though the algorithm will generate a different dataset each time, the number of elements per class is the same in every re-run with the same base, target and source datasets, and delta. To verify this, the user can check the difference of the classcount.csv files between the original and the replay:

diff <(sort input_datasets/Drebin_deb-01/Drebin_deb-01.features_specs.classcount.csv) \
<(sort out/Drebin_deb-01-replay/Drebin_deb-01-replay.features_specs.classcount.csv)

If the debiasing algorithm fails, several solutions can be tested:

  • increase the delta value to let the output be farther from the target
  • provide more samples in the source datasets (probably some classes do not have enough applications)

Comparing datasets with a population

When new datasets are generated, or with the input datasets, the user may want to evaluate the distance between these datasets and an extract of the population. In particular, we provide a script to evaluate the Chi2 and the p-value, using the following command:

python3 config.datasets.ini
--population population_dataset \
--datasets list of datasets to evaluate \
--append-population-name \
--filename "filename"

The parameters are: - population: indicates the name of the dataset to use as an extract of the considered population - datasets: a list of dataset names that can be located both in the datasets/ or out/ folders. - (optional) --append-population-name: add the name of the population in the output file - (optional) --filename: the name of the output file

The outputs are:

  • out/count_analysis_output/filename_population_dataset.xlsx: a tabular containing the comparison of the considered datasets (Chi2, p-value, added/removed app count, etc.)
  • out/count_analysis_output/filename_population_dataset.tex: a latex tabular containing max delta and the size of datasets.

These outputs, in particular the latex output, can be customized to your needs.

For example, for comparing the following three datasets with the extract of AndroZoo of size 30k extracted in 2020 (AZ20_30k):

  • Drebin: the original Drebin dataset
  • Drebin_deb-01: the debiased dataset already computed and dropped in the datasets/ folder
  • Drebin_deb-01-replay: the debiased dataset just generated by following this README

The user should use the following command:

python3 config.datasets.ini \
--population AZ20_30k \
--datasets Drebin Drebin_deb-01{,-replay} \
--append-population-name --filename "Drebin_debias_replay"

As shown in the output of the script, Drebin_deb-01 and the replay (Drebin_deb-01-replay) have the same Chi2 value, which is expected. The file out/count_analysis_output/Drebin_debias_replay_AZ20_30k.xlsx contains a table with information about Drebin and the provided debiased dataset and the new generated one:

Count analysis result

Mix dataset debiasing algorithm

For producing mixed datasets, we provide a script that takes two datasets as input: one should contain the malware, the other the goodware.

The command is the following:

python3 config.datasets.ini \
"id_name_of_the_generated_dataset" "Full name of the generated dataset" \
id_debiased_malware_dataset \
id_debiased_goodware_dataset \
year-time-barrier_training-test \
--date-fix year_fix.sha256.csv

with the parameters:

  • id_debiased_malware_dataset: the id of the malware dataset that will be loaded from folder datasets/ and out/.
  • id_debiased_goodware_dataset: the id of the goodware dataset that will be loaded from folder datasets/ and out/.
  • year-time-barrier_training-test: an integer representing the year used to split the datasets into the training part and the test part.
  • (optional) --date-fix year_fix.sha256.csv: a helper file that the user should provide to help the identification of the date of broken APKs. Indeed, some APK has a date of 0 when extracting the date from the APK archive (zip date construction). In this case, the script can open the helper file to search for an alternative date.
  • (optional) --percent: specify the percent of malware applications for the output test dataset (the default is 5%)

For example, for mixing the “debiased Drebin” dataset just created before (DR-AG_deb-replay) and the “debiased NAZE” dataset, and for using 2013 as a barrier for delimitating the training set and the test set, the user should do:

python3 config.datasets.ini \
"DR-AG-C2_deb-01" "DR-AG-C2_deb-01" \
Drebin_deb-01-replay \
NAZE-18-G_deb-001 \

To leave the C2 condition out, add the “no-balance-time-window” option:

python3 config.datasets.ini \
"DR-AG_deb-01" "DR-AG_deb-01" \
Drebin_deb-01-replay \
NAZE-18-G_deb-001 \
2013 \

For specifying 10% of malware, add the “percent” option followed by 10:

python3 config.datasets.ini \
"DR-AG-C2_deb-01" "DR-AG-C2_deb-01" \
Drebin_deb-01-replay \
NAZE-18-G_deb-001 \
2013 \
--percent 10

This script outputs two folders, one for the training set, one for the test set. The content of this folder is similar to the debiasing of a single dataset. For example, the mixing of Drebin and Naze generates:

  • out/DR-AG-C2_deb-01-replay-training
    • DR-AG-C2_deb-01-replay-training.characteristics.csv
    • DR-AG-C2_deb-01-replay-training.features_specs.classcount.csv
    • DR-AG-C2_deb-01-replay-training.features_specs.combination_hashes.json
    • DR-AG-C2_deb-01-replay-training.features_specs.dataset_class_info.json
  • out/DR-AG-C2_deb-01-replay-test-5p
    • DR-AG-C2_deb-01-replay-test-5p.characteristics.csv
    • DR-AG-C2_deb-01-replay-test-5p.features_specs.classcount.csv
    • DR-AG-C2_deb-01-replay-test-5p.features_specs.combination_hashes.json
    • DR-AG-C2_deb-01-replay-test-5p.features_specs.dataset_class_info.json

Comparing the intersection of two datasets

To count the number of elements in these replays and the original ones, can be used for this purpose:

python3 config.datasets.ini \
--source-datasets list of datasets used for mixing \
--datasets mixed datasets \
--date-fix year_fix.sha256.csv

The source datasets are the ones that have been used for producing the mixed datasets. The script helps to control the good balance of applications in the produced mixed datasets.

For example, for analysing the mixed dataset “Drebin debiased” and “NAZE debiased”, the user should do:

python3 config.datasets.ini \
--source-datasets Drebin_deb-01-replay NAZE-18-G_deb-001 \
--datasets DR-AG-C2_deb-01-training DR-AG-C2_deb-01-test-5p

The output shows that the test set does not contain any application from debias-drebin or debias-goodware before 2013. Indeed, the test set should start for years greater than 2013. We also show that the training set is balanced between goodware and malware for each year.

For example, the training set contains the following (malware/goodware balanced) for the last available year:

Total for 2012:
| | Drebin_deb-01 (Replay) | NAZE-18-G_deb-001 | Total |
| DR-AG-C2_deb-01-training | 94 | 94 | 188 |

And the test set contains the following (5% malware) for the first available year:

Total for 2014:
| | Drebin_deb-01 (Replay) | NAZE-18-G_deb-001 | Total |
| DR-AG-C2_deb-01-test-5p | 2 | 38 | 40 |

Notice that, because the hashes in the debiased datasets are different most of the time, the results shown may differ from the ones obtained with a new “Drebin debiased” and “NAZE debiased”. However, using the same datasets as inputs (the ones generated in the previous section), but with a different “id” and “name”, the result mix dataset will have the same number of hashes.

Performing all debiasing experiments

Scripts are provided to repoduce the experiments found in the paper. To do this, first launch the following script if you have not already, it will create all the necessary folders in order to continue:

python3 config.datasets.original.ini datasets

For reproducing all experiments produced in Table II, the user can launch the following script:


For reproducing all experiments produced in Table III, after reproducing the ones of Table II, the user can do:


Including extra features from FalDroid

For more information about generating additional features using FalDroid, please go to this repository.

After the output arff files are generated, they must be tranformed to the proper .merged_characteristics.csv to be used for ML experiments. To do this, the script transformes arff files to .graph_characteristics.csv. Then, the script will join these with the respective .characteristics.csv file to generate a .merged_characteristics.csv file. This last type of files can be used with ML experiments (see section Machine Learning Experiments)

Downloading APK datasets

We cannot provide the samples directly in this zip archive, as our institution does not allow us to do so. Nevertheless, we provide scripts to recover them from the sha256.txt files.

Goodware datasets

Goodware datasets can be downloaded from AndroZoo, using the script “”:

usage: [-h] api_key_file hash_list_file output_dir

For example for Drebin:

python3 api_key_androzoo api_key_virusshare datasets/drebin.sha256.txt tmp
Num hashes: 5304
sha256 to download: a7f5522c5775945950aab6531979c78fd407238131fabd94a0cb47343a402f91

Malware datasets

Malware datasets can be partially found in AndroZoo. Drebin and AMD are available, but all VirusShare datasets should be downloaded from the VirusShare website.

Mixed datasets

The mixed datasets can be fully downloaded from AndroZoo:

Usage: python3 api_key_androzoo api_key_virusshare outdir

python3 api_key_androzoo ./

The script creates the following tree and populates them:

├── DR-AG_deb
│   ├── DR-AG_deb-test_no-bal-time-win
│   └── DR-AG_deb-training_no-bal-time-win
├── DR-AG-C2_deb
│   ├── DR-AG-C2_deb-test
│   └── DR-AG-C2_deb-training
├── VS-AG_deb
│   ├── VS-AG_deb-test_no-bal-time-win
│   └── VS-AG_deb-training_no-bal-time-win
├── VS-AG-C2_deb
│   ├── VS-AG-C2_deb-test_no-bal-time-win
│   └── VS-AG-C2_deb-training_no-bal-time-win
├── VS-AG_deb-04
│   ├── VS-AG_deb-04-test_no-bal-time-win
│   └── VS-AG_deb-04-training_no-bal-time-win


The script can be interrupted and you can relaunch the download.

Machine Learning experiments

For redoing ML experiments, please see the dedicated README.


Simple text file obtained from manually scraping the web for the question "What is Machine Learning?".

The files contain the first paragraph/ page on the website's approach to answer the question. This data is not used for commercial purposes and is available to all.

This data is used in TAES to show how it can be used for plagiarism checking. The text files (*.txt) contain plain text and need no preprocessing to use. Simply read the file and assign the data to a string object. 


This dataset is taken from 20 subjects over a duration of 1 hour where experiments were done on the upper body bio-impedance with the following objectives:

a)     Evaluate the effect of externally induced perturbance at the SE interface caused by motion, applied pressure, temperature variation and posture change on bio-impedance measurements.

b)     Evaluate the degree of distortion due to artefact at multiple frequencies (10kHz-100kHz) in the bio-impedance measurements.


The PD-BioStampRC21 dataset provides data from a wearable sensoraccelerometry study conducted for studying activity, gait, tremor, andother motor symptoms in individuals with Parkinson's disease (PD).  Inaddition to individuals with PD, the dataset also includes data forcontrols that also went through the same study protocol as the PDparticipants.  Data were acquired using lightweight MC 10 BioStamp RCsensors (MC 10 Inc, Lexington, MA), five of which were attached toeach participant for gathering data over a roughly two dayinterval.


Users of the dataset should cite the following paper:

Adams JL, Dinesh K, Snyder CW, Xiong M, Tarolli CG, Sharma S, Dorsey E, Sharma G. "A real-world study of wearable sensors in Parkinson’s disease". NPJ Parkinson's disease. 2021 Nov 29;7(1):1-8.

An overview of the study protocol is also provided in the abovementioned paper. Additional detail specific to the dataset and filenaming conventions is provided here.

The dataset is comprised of two main components: (I) Sensor andUPDRS-assessment-task annotation data for each participant and (II)demographic and clinical assessment data for all participants. Each ofthese is described in turn below:

I) Sensor and UPDRS-assessment-task annotation data:

The sensor accelerometry and UPDRS-assessment-task annotation data forall the participants are provided as a zip file The size of the zip file is 11GB and,when unzipped, it generates a set of folders and files with a totalsize of approximately 56GB. Unzipping the file generates folders withname matching the participant ID for each of the Control and PDparticipants (17 Control + 17 PD). Each participant folder containsthe data organized as the following files.

a) Accelerometer sensor data files (CSV) corresponding to the fivedifferent sensor placement locations, which are abbreviated as:  

1) Trunk (chest)           - abbreviated as "ch"  

2) Left anterior thigh     - abbreviated as "ll"  

3) Right anterior thigh    - abbreviated as "rl"  

4) Left anterior forearm   - abbreviated as "lh"  

5) Right anterior forearm  - abbreviated as "rh"   

Example file name for accelerometer sensor data files:   "AbbreviatedSensorLocation"_ID"ParticipantID"Accel.csv   E.g. ch_ID018Accel.csv, ll_ID018Accel.csv, rl_ID018Accel.csv,   lh_ID018Accel.csv, and rh_ID018Accel.csv  

File format for the accelerometer sensor data files: Comprises of four columns that provide a timestamp for each measurement and   corresponding triaxial accelerometry relative to the sensor   coordinate system.     

Column 1: "Timestamp (ms)" - Time in milliseconds  

Column 2: "Accel X (g)"    - Acceleration in X-direction (in units of g = 9.8 m/s^2)

   Column 3: "Accel Y (g)"    - Acceleration in Y-direction (in units of g = 9.8 m/s^2)

   Column 4: "Accel Z (g)"    - Acceleration in Z-direction (in units of g = 9.8 m/s^2)

   Times and timestamps are consistently reported in units of   milliseconds starting from the instant of the earliest sensor   recording (for the first sensor applied to the participant).

b) Annotation file (CSV). This file provides tagging annotations for   the sensor data that identify, via start and end timestamps, the   durations of various clinical assessments performed in the study.   

   Example file name for annotation file: AnnotID"ParticipantID".csv   E.g. AnnotID018.csv   

   File format for the annotation file: Comprises of four columns

   Column 1: "Event Type"           - List of in-clinic MDS-UPDRS assessments. Each assessment comprises of                                       two queries -  medication status and MDS-UPDRS assessment body locations

   Column 2: "Start Timestamp (ms)" - Start timestamp for the MDS-UPDRS assessments

   Column 3: "Stop Timestamp (ms)"  - Stop timestamp for the MDS-UPDRS assessments

   Column 4: "Value"                - Responses to the queries in Column 1 - medication status (OFF/ON) and                                       MDS-UPDRS assessment body locations (E.g. RIGHT HAND, NECK, etc.)   

II) Demographic and clinical assessment data

For all participants, the demographic and clinical assessment data areprovided as a zip file "". Unzippingthe file generates a CSV file named Clinic_Data_PD-BioStampRC21.csv

File format for the demographic and clinical assessment data file: Comprises of 19 columns

Column 1: "ID"                                               - Participant ID

Column 2: "Sex"                                              - Participant sex (Male/Female)

Column 3: "Status"                                           - Participant disease status (PD/Control)

Column 4: "Age"                                              - Participant age

Column 5: "updrs_3_17a"                                      - Rest tremor amplitude (RUE - Right Upper Extremity)

Column 6: "updrs_3_17b"                                      - Rest tremor amplitude (LUE - Left Upper Extremity)

Column 7: "updrs_3_17c"                                      - Rest tremor amplitude (RLE - Right Lower Extremity)

Column 8: "updrs_3_17d"                                      - Rest tremor amplitude (LLE - Right Lower Extremity)

Column 9: "updrs_3_17e"                                      - Rest tremor amplitude (Lip/Jaw)

Column 10 - Column 14: "updrs_3_17a_off" - "updrs_3_17e_off" - Rest tremor amplitude during OFF medication assessment                                                                (ordering similar as that from Column 5 to Column 9)

Column 15 - Column 19: "updrs_3_17a_on" - "updrs_3_17e_on"   - Rest tremor amplitude during ON medication assessment

Note that columns 10-19 do not contain any data for controlparticipants and for PD participants that did not participate in theON/OFF medication component of the assessment protocol for the study.

For details about different MDS-UPDRS assessments and scoring schemes, the reader is referred to:        

Goetz, C. G. et al. Movement Disorder Society-sponsored revision ofthe Unified Parkinson's Disease Rating Scale (MDS-UPDRS): scalepresentation and clinimetric testing results. Mov Disord 23,2129-2170, doi:10.1002/mds.22340 (2008)


Amidst the COVID-19 pandemic, cyberbullying has become an even more serious threat. Our work aims to investigate the viability of an automatic multiclass cyberbullying detection model that is able to classify whether a cyberbully is targeting a victim’s age, ethnicity, gender, religion, or other quality. Previous literature has not yet explored making fine-grained cyberbullying classifications of such magnitude, and existing cyberbullying datasets suffer from quite severe class imbalances.


Please cite the following paper when using this open access dataset:

J. Wang, K. Fu, C.T. Lu, “SOSNet: A Graph Convolutional Network Approach to Fine-Grained Cyberbullying Detection,” Proceedings of the 2020 IEEE International Conference on Big Data (IEEE BigData 2020), pp. 1699-1708, December 10-13, 2020.

This is a "Dynamic Query Expansion"-balanced dataset containing .txt files with 8000 tweets for each of a fine-grained class of cyberbullying: age, ethnicity, gender, religion, other, and not cyberbullying.

Total Size: 6.33 MB


Includes some data from:

S. Agrawal and A. Awekar, “Deep learning for detecting cyberbullying across multiple social media platforms,” in European Conference on Information Retrieval. Springer, 2018, pp. 141–153.

U. Bretschneider, T. Wohner, and R. Peters, “Detecting online harassment in social networks,” in ICIS, 2014.

D. Chatzakou, I. Leontiadis, J. Blackburn, E. D. Cristofaro, G. Stringhini, A. Vakali, and N. Kourtellis, “Detecting cyberbullying and cyberaggression in social media,” ACM Transactions on the Web (TWEB), vol. 13, no. 3, pp. 1–51, 2019.

T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated hate speech detection and the problem of offensive language,” arXiv preprint arXiv:1703.04009, 2017.

Z. Waseem and D. Hovy, “Hateful symbols or hateful people? predictive features for hate speech detection on twitter,” in Proceedings of the NAACL student research workshop, 2016, pp. 88–93.

Z. Waseem, “Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter,” in Proceedings of the first workshop on NLP and computational social science, 2016, pp. 138–142.

J.-M. Xu, K.-S. Jun, X. Zhu, and A. Bellmore, “Learning from bullying traces in social media,” in Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: Human language technologies, 2012, pp. 656–666. 


This dataset brings some problem sets and results from some classical algorithms from the evolutionary computational community.

We have used some tools: Pymoo, Platypus and Pagmo