This dataset is part of my Master's research on malware detection and classification using the XGBoost library on Nvidia GPU. The dataset is a collection of 1.55 million of 1000 API import features extract from jsonl format of the EMBER dataset 2017 v2 and 2018. All data is pre-processing, duplicated records are removed. The dataset contains 800,000 malware and 750,000 "goodware" samples.



Column name:  sha256

Description: SHA256 hash of the example

Type: string


Column name:  appeared

Description: appeared date of the sample

Type: date (yyyy-mm format)


Column name:  label

Description: specify malware or "goodware" of the sample

Type: 0 ("goodware") or 1 (malware)


Column name: GetProcAddress

Description: Most imported function (1st)

Type: 0 (Not imported) or 1 (Imported)



Column name: LookupAccountSidW

Description: Least imported function (1000th)

Type: 0 (Not imported) or 1 (Imported)


The full dataset features header can be downloaded at

All processing code will be uploaded to


The data uploaded here shall support the paper 

Decision Tree Analysis of  ...

which has been submitted to IEEE Transactions on Medical Imaging (2020, September 25) by the authors

Julian Mattes, Wolfgang Fenz, Stefan Thumfart, Gerhard Haitchi, Pierre Schmit, Franz A. Fellner

During review the data shall only be visible for the reviewers of this paper. Afterwards this abstract will be modified and complemented and a dataset image will be uploaded.