Troid: Temporal and Cross-Sectional Android Dataset and Its Applications

Citation Author(s):
Ali
Al Kinoon
Abdulaziz
Alghamdi
Ahod
Alghuried
David
Mohaisen
Submitted by:
Ali Al Kinoon
Last updated:
Wed, 11/20/2024 - 07:00
DOI:
10.21227/95my-tf46
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

Numerous studies have focused on exploring Android malware in recent years, covering areas such as malware detection and application analysis. As a result, there is a pressing need for a reliable and scalable malware dataset to support the development and evaluation of effective malware studies. Although several benchmarks for Android malware datasets are widely used in research, they have significant limitations. Firstly, many of these datasets are outdated and do not capture current malware trends. Additionally, some have become obsolete or inaccessible, limiting their usefulness. Secondly, most datasets only contain the apps themselves (APKs), lacking important meta features like content rating, ad coverage, user ratings, and privacy policies. This omission restricts the potential applications of these datasets. This paper introduces a reliable Android malware dataset called \ours{} and sourced from the Google Play Store Market, covering the period from 2019 to 2023. To label malicious apps, we use VirusTotal and track their availability and removal status on the Google Play Store. We curate a meticulous Android malware dataset with 5,028 samples using this method. We augment our dataset with various features, including privacy policies, metadata, control flow graphs, permissions, API calls, strings, function names, hex dumps, and labels. We believe this benchmark dataset will greatly support various research efforts, including Android malware classification and detection, static program analysis, and privacy policy analysis.

Instructions: 

Our dataset comprises 5,028 APKs with 8 unique features spanning five years from 2019 to 2023, covering 16 genres such as finance, sports, games, and more. The dataset is initially compressed. After decompressing, you will find each feature titled and annotated.

Within each feature folder, there are 5 subfolders, one for each year from 2019 to 2023. Inside each yearly folder, you will find 16 genre categories. For example, within the sports genre, you will find data for benign apps, and a subset folder titled 'malicious' for malicious apps.

For the apps themselves, the structure is similar: 5 folders representing each year from 2019 to 2023. Each yearly folder contains 16 genre categories, with each genre folder including benign apps and an additional folder for malicious apps.

Please note that one of the eight features, the 'control flow graph,' is large in size for the five-year span, exceeding the upload limit of DataPort. If you are interested in this feature, requests can be made via email at alialkinoon@ucf.edu. For any questions or concerns regarding the dataset, feel free to reach out to the same email.