Skip to main content

Datasets

Open Access

1.55M API IMPORT DATASET for MALWARE ANALYSIS

Citation Author(s):
Quynh Trinh
Submitted by:
Quynh Trinh
Last updated:
DOI:
10.21227/98jc-y909
Data Format:
Links:
No Ratings Yet

Abstract

This dataset is part of my Master's research on malware detection and classification using the XGBoost library on Nvidia GPU. The dataset is a collection of 1.55 million of 1000 API import features extract from jsonl format of the EMBER dataset 2017 v2 and 2018. All data is pre-processing, duplicated records are removed. The dataset contains 800,000 malware and 750,000 "goodware" samples.

Instructions:

* FEATURES *

Column name:  sha256


Description: SHA256 hash of the example


Type: string

 

Column name:  appeared


Description: appeared date of the sample


Type: date (yyyy-mm format)

 

Column name:  label


Description: specify malware or "goodware" of the sample


Type: 0 ("goodware") or 1 (malware)

 

Column name: GetProcAddress


Description: Most imported function (1st)


Type: 0 (Not imported) or 1 (Imported)

 

...

Column name: LookupAccountSidW


Description: Least imported function (1000th)


Type: 0 (Not imported) or 1 (Imported)

 

The full dataset features header can be downloaded at https://github.com/tvquynh/api_import_dataset/blob/main/full_dataset_features.csv

Alternative download dataset link at https://drive.google.com/drive/folders/1rXmo01fzWFgnUD0OF2qsphHWwV-1qtLg?usp=sharing

All processing code will be uploaded to https://github.com/tvquynh/api_import_dataset/