Datasets
Standard Dataset
AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information
- Citation Author(s):
- Submitted by:
- Saurabh Kumar
- Last updated:
- Thu, 04/14/2022 - 02:51
- DOI:
- 10.21227/9ptx-5d17
- Data Format:
- Links:
- License:
- Categories:
- Keywords:
Abstract
With the large-scale adaptation of Android OS and ever-increasing contributions in the Android application space, Android has become the number one target of malware authors. In recent years, a large number of automatic malware detection and classification systems have evolved to tackle the dynamic nature of malware growth using either static or dynamic analysis techniques. Performance of static malware detection methods degrades due to the obfuscation attacks. Although many benchmark datasets are available to measure the performance of malware detection and classification systems, only a single obfuscated malware dataset (PRAGuard) is available to showcase the efficacy of the existing malware detection systems against the obfuscation attacks. PRAGuard contains outdated samples till 2013 and does not represent the latest application categories. Moreover, PRAGuard does not provide the family information for malware because of which PRAGuard can not be used to evaluate the efficacy of the malware family classification systems. Hence, we create and release AndroOBFS, a time-tagged obfuscated malware dataset with familial information spanning over three years from 2018 to 2020.
The AndroOBFS dataset contains 16279 unique real-world obfuscated malware samples in six categories viz. (i) Trivial, (ii) Renaming, (iii) Encryption, (iv) Reflection, (v) Code, and (vi) Mix (a mix of two or more methods from (i) to (v)). Out of 16279 unique obfuscated malware samples, 14579 samples are distributed across 158 families with at least two unique malware samples in each family. We store all the information about obfuscated malware with family in two CSV files; one CSV file corresponds to 16279 samples (16279.csv ) and the other for 14579 familial malware samples (14579.csv). We release this dataset to aid the Android malware study in designing robust and obfuscation resilient malware detection and classification systems.
The part of AndroOBFS dataset (4993 samples of year 2019) is first designed as part of our paper DeepDetect: A Practical On-device Android Malware Detector accepted and published in 21st IEEE International Conference on Software Quality, Reliability, and Security (QRS) 2021. Whereas entire AndroOBFS dataset is associated with another paper AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information accepted at IEEE/ACM 19th International Conference on Mining Software Repositories (MSR) 2022. If you use any part of this dataset or entire dataset, please cite our works in your research papers. You can cite our paperes using bibtex as follows
@INPROCEEDINGS{9724811,
author={Kumar, Saurabh and Mishra, Debadatta and Panda, Biswabandan and Shukla, Sandeep Kumar},
booktitle={2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS)},
title={DeepDetect: A Practical On-device Android Malware Detector},
year={2021},
volume={},
number={},
pages={40-51},
doi={10.1109/QRS54544.2021.00015}
}
@INPROCEEDINGS{AndroOBFS:Kumar,
author={Kumar, Saurabh and Mishra, Debadatta and Panda, Biswabandan and Shukla, Sandeep Kumar},
booktitle={2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR)},
title={AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information},
year={2022},
volume={},
number={},
pages={}
}
Or
[1] S. Kumar, D. Mishra, B. Panda and S. K. Shukla, "DeepDetect: A Practical On-device Android Malware Detector," 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS), 2021, pp. 40-51, https://doi.org/10.1109/QRS54544.2021.00015. [2] S. Kumar, D. Mishra, B. Panda and S. K. Shukla, "AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information," 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022.
Dataset Description
The AndroOBFS dataset contains 16279 unqiues real-world obfuscated malware sample in six categories ---(i) trivial, (ii) renaming, (iii) encryption, (iv) reflection, (v) code, and (vi) mix (a mix of two or more obfuscation method from (i) to (v)). Out of 16279 unique obfuscated malware samples, 14579 samples are distributed across 158 families with at least two unique malware samples in each family.
This AndroOBFS dataset repository contents 38 files out of which two are the CSV files and rest 36 are compressed file in 7zip (.7z) format. The CSV file contains the malware samples name, obfuscation categories, malware family name and the information about the year, quarter and month on which actual malware is born without obfuscating it. List of files and their description is given below.
- 16279.csv: CSV file corresponds to 16279 samples without family information.
- 14579.csv: CSV file corresponds to 14579 familial malware samples.
- 2018_Q1_1.7z: Obfuscated samples for the month of January in quarter 1 of 2018.
- 2018_Q1_2.7z: Obfuscated samples for the month of February in quarter 1 of 2018.
- 2018_Q1_3.7z: Obfuscated samples for the month of March in quarter 1 of 2018.
- 2018_Q2_4.7z: Obfuscated samples for the month of April in quarter 2 of 2018.
- 2018_Q2_5.7z: Obfuscated samples for the month of May in quarter 2 of 2018.
- 2018_Q2_6.7z: Obfuscated samples for the month of June in quarter 2 of 2018.
- 2018_Q3_7.7z: Obfuscated samples for the month of July in quarter 3 of 2018.
- 2018_Q3_8.7z: Obfuscated samples for the month of August in quarter 3 of 2018.
- 2018_Q3_9.7z: Obfuscated samples for the month of September in quarter 3 of 2018.
- 2018_Q4_10.7z: Obfuscated samples for the month of October in quarter 4 of 2018.
- 2018_Q4_11.7z: Obfuscated samples for the month of November in quarter 4 of 2018.
- 2018_Q4_12.7z: Obfuscated samples for the month of December in quarter 4 of 2018.
- 2019_Q1_1.7z: Obfuscated samples for the month of January in quarter 1 of 2019.
- 2019_Q1_2.7z: Obfuscated samples for the month of February in quarter 1 of 2019.
- 2019_Q1_3.7z: Obfuscated samples for the month of March in quarter 1 of 2019.
- 2019_Q2_4.7z: Obfuscated samples for the month of April in quarter 2 of 2019.
- 2019_Q2_5.7z: Obfuscated samples for the month of May in quarter 2 of 2019.
- 2019_Q2_6.7z: Obfuscated samples for the month of June in quarter 2 of 2019.
- 2019_Q3_7.7z: Obfuscated samples for the month of July in quarter 3 of 2019.
- 2019_Q3_8.7z: Obfuscated samples for the month of August in quarter 3 of 2019.
- 2019_Q3_9.7z: Obfuscated samples for the month of September in quarter 3 of 2019.
- 2019_Q4_10.7z: Obfuscated samples for the month of October in quarter 4 of 2019.
- 2019_Q4_11.7z: Obfuscated samples for the month of November in quarter 4 of 2019.
- 2019_Q4_12.7z: Obfuscated samples for the month of December in quarter 4 of 2019.
- 2020_Q1_1.7z: Obfuscated samples for the month of January in quarter 1 of 2020.
- 2020_Q1_2.7z: Obfuscated samples for the month of February in quarter 1 of 2020.
- 2020_Q1_3.7z: Obfuscated samples for the month of March in quarter 1 of 2020.
- 2020_Q2_4.7z: Obfuscated samples for the month of April in quarter 2 of 2020.
- 2020_Q2_5.7z: Obfuscated samples for the month of May in quarter 2 of 2020.
- 2020_Q2_6.7z: Obfuscated samples for the month of June in quarter 2 of 2020.
- 2020_Q3_7.7z: Obfuscated samples for the month of July in quarter 3 of 2020.
- 2020_Q3_8.7z: Obfuscated samples for the month of August in quarter 3 of 2020.
- 2020_Q3_9.7z: Obfuscated samples for the month of September in quarter 3 of 2020.
- 2020_Q4_10.7z: Obfuscated samples for the month of October in quarter 4 of 2020.
- 2020_Q4_11.7z: Obfuscated samples for the month of November in quarter 4 of 2020.
- 2020_Q4_12.7z: Obfuscated samples for the month of December in quarter 4 of 2020.
Here the zip file follows a naming convention and contains 6 different directories corresponds to the obfuscation methods which holds the actual obfuscated malware samples i.e. APK files. Compressed file naming convention and directory structure as follows
.
├─year_quarter_month.7z
| ├─trivial
| | └── APK files for trivial category obfuscation methods
| ├─renaming
| | └── APK files for renaming category obfuscation methods
| ├─encryption
| | └── APK files for encryption category obfuscation methods
| ├─reflection
| | └── APK files for reflection category obfuscation methods
| ├─code
| | └── APK files for code category obfuscation methods
| └─mix
| └── APK files for mix category obfuscation methods
...
For example, compressed file 2018_Q1_1.7z contains the obfuscated malware sample born in the month of January in quarter 1 of year 2018.
Fields in CSV files
Available data fields of both the CSV files (16279.csv and 14579.csv) and their description is mentioned below:
<--------------------------------------- 16279.csv -------------------------------------------->
| file_name | sha256_hash | path | year | quarter | month | method |
<------------------------------------------------ 14579.csv -------------------------------------------------------->
| file_name | sha256_hash | path | year | quarter | month | method | family_name |
- file_name: Stores the name of the obfuscated APK file.
- sha256_hash: This contains the SHA256 hash of a non-obfuscated sample (original malware), which can be downloaded from the AndroZoo Project and VirusShare.com.
- path: Provides the location of obfuscated malware sample where it resides in the compressed file. For example, 2018_Q1_1.7z/code that the corresponding sample is present in the code folder of compressed file 2018_Q1_1.7z
- year:Year of the original malware sample in which it was born.
- quarter: Quarter of the year, the corresponding malware sample was born.
- month: Month in the specified year in which the corresponding malware sample was born.
- method: Describe the obfuscation category through which corresponding malware is obfuscated.
- family_name: Name of the family to which a malware belongs.
For more details about the Dataset and the dataset creation process, please refer to our paper Dataset AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information accepted at IEEE/ACM 19th International Conference on Mining Software Repositories (MSR) 2022.
Dataset Files
- 16279.csv (2.67 MB)
- 14579.csv (2.49 MB)
- 2018_Q1_1.7z (2.82 GB)
- 2018_Q1_2.7z (2.46 GB)
- 2018_Q1_3.7z (3.71 GB)
- 2018_Q2_4.7z (3.29 GB)
- 2018_Q2_5.7z (5.02 GB)
- 2018_Q2_6.7z (4.47 GB)
- 2018_Q3_7.7z (4.53 GB)
- 2018_Q3_8.7z (4.65 GB)
- 2018_Q3_9.7z (5.83 GB)
- 2018_Q4_10.7z (7.06 GB)
- 2018_Q4_11.7z (12.77 GB)
- 2018_Q4_12.7z (4.99 GB)
- 2019_Q1_1.7z (6.03 GB)
- 2019_Q1_2.7z (6.28 GB)
- 2019_Q1_3.7z (10.31 GB)
- 2019_Q2_4.7z (6.42 GB)
- 2019_Q2_5.7z (4.61 GB)
- 2019_Q2_6.7z (5.70 GB)
- 2019_Q3_7.7z (4.67 GB)
- 2019_Q3_8.7z (5.41 GB)
- 2019_Q3_9.7z (17.12 GB)
- 2019_Q4_10.7z (21.07 GB)
- 2019_Q4_11.7z (15.02 GB)
- 2019_Q4_12.7z (2.73 GB)
- 2020_Q1_1.7z (1.18 GB)
- 2020_Q1_2.7z (413.67 MB)
- 2020_Q1_3.7z (918.62 MB)
- 2020_Q2_4.7z (2.37 GB)
- 2020_Q2_5.7z (3.65 GB)
- 2020_Q2_6.7z (4.46 GB)
- 2020_Q3_7.7z (1.16 GB)
- 2020_Q3_8.7z (1.01 GB)
- 2020_Q3_9.7z (1.15 GB)
- 2020_Q4_10.7z (913.46 MB)
- 2020_Q4_11.7z (617.60 MB)
- 2020_Q4_12.7z (953.47 MB)
Comments
Initial Version