Datasets
Open Access
1.55M API IMPORT DATASET for MALWARE ANALYSIS
- Citation Author(s):
- Submitted by:
- Quynh Trinh
- Last updated:
- Fri, 02/25/2022 - 04:07
- DOI:
- 10.21227/98jc-y909
- Data Format:
- Links:
- License:
- Categories:
- Keywords:
Abstract
This dataset is part of my Master's research on malware detection and classification using the XGBoost library on Nvidia GPU. The dataset is a collection of 1.55 million of 1000 API import features extract from jsonl format of the EMBER dataset 2017 v2 and 2018. All data is pre-processing, duplicated records are removed. The dataset contains 800,000 malware and 750,000 "goodware" samples.
* FEATURES *
Column name: sha256
Description: SHA256 hash of the example
Type: string
Column name: appeared
Description: appeared date of the sample
Type: date (yyyy-mm format)
Column name: label
Description: specify malware or "goodware" of the sample
Type: 0 ("goodware") or 1 (malware)
Column name: GetProcAddress
Description: Most imported function (1st)
Type: 0 (Not imported) or 1 (Imported)
...
Column name: LookupAccountSidW
Description: Least imported function (1000th)
Type: 0 (Not imported) or 1 (Imported)
The full dataset features header can be downloaded at https://github.com/tvquynh/api_import_dataset/blob/main/full_dataset_fea...
Alternative download dataset link at https://drive.google.com/drive/folders/1rXmo01fzWFgnUD0OF2qsphHWwV-1qtLg...
All processing code will be uploaded to https://github.com/tvquynh/api_import_dataset/
Dataset Files
- The dataset is a collection of 1.55 million of 1000 API import features extract from the EMBER dataset 2017 v2 and 2018. 1550K MALWARE ANALYSIS DATASETS_API IMPORT.zip (160.17 MB)
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.
Comments
Hi, Can i get access to this dataset for my dissertation?
Hi Vivian,
This is free for use and distribute.
Cheers.