Dataset of dynamic malware detection based on enhanced semantic API sequence features

Citation Author(s):
lei
zhou
Submitted by:
Lei Zhou
Last updated:
Tue, 05/21/2024 - 07:14
DOI:
10.21227/2vr4-n584
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

Dynamic malicious software detection aims to assess whether executable programs exhibit malicious behavior by thoroughly studying and analyzing their dynamic features. However, many current methodologies insufficiently explore the semantic features of API sequences and instead rely more on mining parameter information during API call processes to enhance detection performance. This leads to issues such as excessive dependence on prior knowledge, larger model parameter sizes, and higher computational complexities. To that end, this paper proposes an enhanced semantic API sequence feature dynamic malware detection scheme that integrates the RoBERTa pre-training model and gating mechanism. This scheme solely leverages API call sequences that can comprehensively capture the contextual semantic information implicitly embedded during executable file execution. Meanwhile, dynamically adjusting the weights of various modal features within the model enhances sensitivity to different malicious software samples. By fusing multimodal features, our approach comprehensively captures both the semantic and global characteristics of API sequences, enabling the model to adapt more flexibly to malware variants and thereby improving detection accuracy and robustness. Experimental results demonstrate that our proposed approach achieves classification accuracies exceeding 99% across multiple publicly available datasets.

Instructions: 

We employed the dynamic dataset from Datacon to validate the effectiveness of our proposed approach. This dataset captures the behavior of executable programs in a sandbox environment, comprising 85,000 reports on malicious samples and 20,000 reports on benign samples. To ensure the balance between malicious and benign samples in our experiments, we randomly selected 20,000 reports of malicious software and 20,000 reports of benign software from the dynamic behavior reports. These reports were then divided into two datasets: TrainDataset and TestDataset.  Each dataset consists of 10,000 reports of malicious software and 10,000 reports of benign software.

To assess the generalization capability of our proposed approach, we also employed the publicly available datasets used in the study: Ki and Catak datasets. The Ki dataset comprises 23,146 malicious samples and 21,116 benign samples, while the Catak dataset consists of 7,107 malicious samples and 169 benign samples.

Datacon dataset from https://github.com/kericwy1337 contains TrainDataset(foldtrain_datacon.json) and TestDataset (foldtest_datacon.json).

Ki (foldtest_ki.json) and Catak (foldtest_catak.json) dataset from the article  E. Amer and I. Zelinka, “A dynamic windows malware detection and prediction method based on contextual understanding of api call sequence,” Computers & Security, vol. 92, p. 101760, 2020.

The datasets above store APIs in JSON format, where "label" represents the tag with 0 indicating benign software and 1 indicating malicious software. "text" represents the dynamic API of the executable program.

 

{"label": "0", "text": "ntqueryattributesfile ntcreatefile ntwaitforsingleobject ntqueryinformationfile ntwaitforsingleobject ntclose ntwaitforsingleobject ntunmapviewofsection ntmapviewofsection ntwaitforsingleobject ntunmapviewofsection ntmapviewofsection ntopenfile ntwaitforsingleobject ntclose ntunmapviewofsection ntterminatethread ntwaitforsingleobject ntclose ntterminateprocess ntclose ntopenkey ntqueryvaluekey ntclose ntwaitforsingleobject ntclose ntterminateprocess"}