This dataset is part of my Master's research on malware detection and classification using the XGBoost library on Nvidia GPU. The dataset is a collection of 1.55 million of 1000 API import features extract from jsonl format of the EMBER dataset 2017 v2 and 2018. All data is pre-processing, duplicated records are removed. The dataset contains 800,000 malware and 750,000 "goodware" samples.

Instructions: 

* FEATURES *

Column name:  sha256

Description: SHA256 hash of the example

Type: string

 

Column name:  appeared

Description: appeared date of the sample

Type: date (yyyy-mm format)

 

Column name:  label

Description: specify malware or "goodware" of the sample

Type: 0 ("goodware") or 1 (malware)

 

Column name: GetProcAddress

Description: Most imported function (1st)

Type: 0 (Not imported) or 1 (Imported)

 

...

Column name: LookupAccountSidW

Description: Least imported function (1000th)

Type: 0 (Not imported) or 1 (Imported)

 

The full dataset features header can be downloaded at https://github.com/tvquynh/api_import_dataset/blob/main/full_dataset_fea...

All processing code will be uploaded to https://github.com/tvquynh/api_import_dataset/

Categories:
12348 Views