1.55M API IMPORT DATASET for MALWARE ANALYSIS

0
0 ratings - Please login to submit your rating.

Abstract 

This dataset is part of my Master's research on malware detection and classification using the XGBoost library on Nvidia GPU. The dataset is a collection of 1.55 million of 1000 API import features extract from jsonl format of the EMBER dataset 2017 v2 and 2018. All data is pre-processing, duplicated records are removed. The dataset contains 800,000 malware and 750,000 "goodware" samples.

Instructions: 

* FEATURES *

Column name:  sha256

Description: SHA256 hash of the example

Type: string

 

Column name:  appeared

Description: appeared date of the sample

Type: date (yyyy-mm format)

 

Column name:  label

Description: specify malware or "goodware" of the sample

Type: 0 ("goodware") or 1 (malware)

 

Column name: GetProcAddress

Description: Most imported function (1st)

Type: 0 (Not imported) or 1 (Imported)

 

...

Column name: LookupAccountSidW

Description: Least imported function (1000th)

Type: 0 (Not imported) or 1 (Imported)

 

The full dataset features header can be downloaded at https://github.com/tvquynh/api_import_dataset/blob/main/full_dataset_fea...

Alternative download dataset link at https://drive.google.com/drive/folders/1rXmo01fzWFgnUD0OF2qsphHWwV-1qtLg...

All processing code will be uploaded to https://github.com/tvquynh/api_import_dataset/

Comments

Hi, Can i get access to this dataset for my dissertation?

Submitted by Vivian Omo-Ojugo on Sun, 02/20/2022 - 21:53

Hi Vivian,
This is free for use and distribute.
Cheers.

Submitted by Quynh Trinh on Thu, 02/24/2022 - 13:51