Datasets
Open Access
1.55M API IMPORT DATASET for MALWARE ANALYSIS
- Citation Author(s):
- Submitted by:
- Quynh Trinh
- Last updated:
- Tue, 02/23/2021 - 04:29
- DOI:
- 10.21227/98jc-y909
- Data Format:
- Links:
- License:
- Categories:
- Keywords:
Abstract
This dataset is part of my Master's research on malware detection and classification using the XGBoost library on Nvidia GPU. The dataset is a collection of 1.55 million of 1000 API import features extract from jsonl format of the EMBER dataset 2017 v2 and 2018. All data is pre-processing, duplicated records are removed. The dataset contains 800,000 malware and 750,000 "goodware" samples.
* FEATURES *
Column name: sha256
Description: SHA256 hash of the example
Type: string
Column name: appeared
Description: appeared date of the sample
Type: date (yyyy-mm format)
Column name: label
Description: specify malware or "goodware" of the sample
Type: 0 ("goodware") or 1 (malware)
Column name: GetProcAddress
Description: Most imported function (1st)
Type: 0 (Not imported) or 1 (Imported)
...
Column name: LookupAccountSidW
Description: Least imported function (1000th)
Type: 0 (Not imported) or 1 (Imported)
The full dataset features header can be downloaded at https://github.com/tvquynh/api_import_dataset/blob/main/full_dataset_fea...
All processing code will be uploaded to https://github.com/tvquynh/api_import_dataset/