Datasets
Standard Dataset
AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information
- Citation Author(s):
- Submitted by:
- Saurabh Kumar
- Last updated:
- Thu, 04/14/2022 - 02:51
- DOI:
- 10.21227/9ptx-5d17
- Data Format:
- Links:
- License:
- Categories:
- Keywords:
Abstract
With the large-scale adaptation of Android OS and ever-increasing contributions in the Android application space, Android has become the number one target of malware authors. In recent years, a large number of automatic malware detection and classification systems have evolved to tackle the dynamic nature of malware growth using either static or dynamic analysis techniques. Performance of static malware detection methods degrades due to the obfuscation attacks. Although many benchmark datasets are available to measure the performance of malware detection and classification systems, only a single obfuscated malware dataset (PRAGuard) is available to showcase the efficacy of the existing malware detection systems against the obfuscation attacks. PRAGuard contains outdated samples till 2013 and does not represent the latest application categories. Moreover, PRAGuard does not provide the family information for malware because of which PRAGuard can not be used to evaluate the efficacy of the malware family classification systems. Hence, we create and release AndroOBFS, a time-tagged obfuscated malware dataset with familial information spanning over three years from 2018 to 2020.
The AndroOBFS dataset contains 16279 unique real-world obfuscated malware samples in six categories viz. (i) Trivial, (ii) Renaming, (iii) Encryption, (iv) Reflection, (v) Code, and (vi) Mix (a mix of two or more methods from (i) to (v)). Out of 16279 unique obfuscated malware samples, 14579 samples are distributed across 158 families with at least two unique malware samples in each family. We store all the information about obfuscated malware with family in two CSV files; one CSV file corresponds to 16279 samples (16279.csv ) and the other for 14579 familial malware samples (14579.csv). We release this dataset to aid the Android malware study in designing robust and obfuscation resilient malware detection and classification systems.
The part of AndroOBFS dataset (4993 samples of year 2019) is first designed as part of our paper DeepDetect: A Practical On-device Android Malware Detector accepted and published in 21st IEEE International Conference on Software Quality, Reliability, and Security (QRS) 2021. Whereas entire AndroOBFS dataset is associated with another paper AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information accepted at IEEE/ACM 19th International Conference on Mining Software Repositories (MSR) 2022. If you use any part of this dataset or entire dataset, please cite our works in your research papers. You can cite our paperes using bibtex as follows
@INPROCEEDINGS{9724811,
author={Kumar, Saurabh and Mishra, Debadatta and Panda, Biswabandan and Shukla, Sandeep Kumar},
booktitle={2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS)},
title={DeepDetect: A Practical On-device Android Malware Detector},
year={2021},
volume={},
number={},
pages={40-51},
doi={10.1109/QRS54544.2021.00015}
}
@INPROCEEDINGS{AndroOBFS:Kumar,
author={Kumar, Saurabh and Mishra, Debadatta and Panda, Biswabandan and Shukla, Sandeep Kumar},
booktitle={2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR)},
title={AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information},
year={2022},
volume={},
number={},
pages={}
}
Or
[1] S. Kumar, D. Mishra, B. Panda and S. K. Shukla, "DeepDetect: A Practical On-device Android Malware Detector," 2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS), 2021, pp. 40-51, https://doi.org/10.1109/QRS54544.2021.00015. [2] S. Kumar, D. Mishra, B. Panda and S. K. Shukla, "AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information," 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022.
Dataset Files
- 16279.csv (2.67 MB)
- 14579.csv (2.49 MB)
- 2018_Q1_1.7z (2.82 GB)
- 2018_Q1_2.7z (2.46 GB)
- 2018_Q1_3.7z (3.71 GB)
- 2018_Q2_4.7z (3.29 GB)
- 2018_Q2_5.7z (5.02 GB)
- 2018_Q2_6.7z (4.47 GB)
- 2018_Q3_7.7z (4.53 GB)
- 2018_Q3_8.7z (4.65 GB)
- 2018_Q3_9.7z (5.83 GB)
- 2018_Q4_10.7z (7.06 GB)
- 2018_Q4_11.7z (12.77 GB)
- 2018_Q4_12.7z (4.99 GB)
- 2019_Q1_1.7z (6.03 GB)
- 2019_Q1_2.7z (6.28 GB)
- 2019_Q1_3.7z (10.31 GB)
- 2019_Q2_4.7z (6.42 GB)
- 2019_Q2_5.7z (4.61 GB)
- 2019_Q2_6.7z (5.70 GB)
- 2019_Q3_7.7z (4.67 GB)
- 2019_Q3_8.7z (5.41 GB)
- 2019_Q3_9.7z (17.12 GB)
- 2019_Q4_10.7z (21.07 GB)
- 2019_Q4_11.7z (15.02 GB)
- 2019_Q4_12.7z (2.73 GB)
- 2020_Q1_1.7z (1.18 GB)
- 2020_Q1_2.7z (413.67 MB)
- 2020_Q1_3.7z (918.62 MB)
- 2020_Q2_4.7z (2.37 GB)
- 2020_Q2_5.7z (3.65 GB)
- 2020_Q2_6.7z (4.46 GB)
- 2020_Q3_7.7z (1.16 GB)
- 2020_Q3_8.7z (1.01 GB)
- 2020_Q3_9.7z (1.15 GB)
- 2020_Q4_10.7z (913.46 MB)
- 2020_Q4_11.7z (617.60 MB)
- 2020_Q4_12.7z (953.47 MB)
Comments
Initial Version