This dataset is part of our research on malware detection and classification using Deep Learning. It contains 42,797 malware API call sequences and 1,079 goodware API call sequences. Each API call sequence is composed of the first 100 non-repeated consecutive API calls associated with the parent process, extracted from the 'calls' elements of Cuckoo Sandbox reports.
* FEATURES *
Column name: hash
Description: MD5 hash of the example
Type: 32 bytes string
Column name: t_0 ... t_99
Description: API call
Type: Integer (0-306)
Column name: malware
Type: Integer: 0 (Goodware) or 1 (Malware)
API Calls: ['NtOpenThread', 'ExitWindowsEx', 'FindResourceW', 'CryptExportKey', 'CreateRemoteThreadEx', 'MessageBoxTimeoutW', 'InternetCrackUrlW', 'StartServiceW', 'GetFileSize', 'GetVolumeNameForVolumeMountPointW', 'GetFileInformationByHandle', 'CryptAcquireContextW', 'RtlDecompressBuffer', 'SetWindowsHookExA', 'RegSetValueExW', 'LookupAccountSidW', 'SetUnhandledExceptionFilter', 'InternetConnectA', 'GetComputerNameW', 'RegEnumValueA', 'NtOpenFile', 'NtSaveKeyEx', 'HttpOpenRequestA', 'recv', 'GetFileSizeEx', 'LoadStringW', 'SetInformationJobObject', 'WSAConnect', 'CryptDecrypt', 'GetTimeZoneInformation', 'InternetOpenW', 'CoInitializeEx', 'CryptGenKey', 'GetAsyncKeyState', 'NtQueryInformationFile', 'GetSystemMetrics', 'NtDeleteValueKey', 'NtOpenKeyEx', 'sendto', 'IsDebuggerPresent', 'RegQueryInfoKeyW', 'NetShareEnum', 'InternetOpenUrlW', 'WSASocketA', 'CopyFileExW', 'connect', 'ShellExecuteExW', 'SearchPathW', 'GetUserNameA', 'InternetOpenUrlA', 'LdrUnloadDll', 'EnumServicesStatusW', 'EnumServicesStatusA', 'WSASend', 'CopyFileW', 'NtDeleteFile', 'CreateActCtxW', 'timeGetTime', 'MessageBoxTimeoutA', 'CreateServiceA', 'FindResourceExW', 'WSAAccept', 'InternetConnectW', 'HttpSendRequestA', 'GetVolumePathNameW', 'RegCloseKey', 'InternetGetConnectedStateExW', 'GetAdaptersInfo', 'shutdown', 'NtQueryMultipleValueKey', 'NtQueryKey', 'GetSystemWindowsDirectoryW', 'GlobalMemoryStatusEx', 'GetFileAttributesExW', 'OpenServiceW', 'getsockname', 'LoadStringA', 'UnhookWindowsHookEx', 'NtCreateUserProcess', 'Process32NextW', 'CreateThread', 'LoadResource', 'GetSystemTimeAsFileTime', 'SetStdHandle', 'CoCreateInstanceEx', 'GetSystemDirectoryA', 'NtCreateMutant', 'RegCreateKeyExW', 'IWbemServices_ExecQuery', 'NtDuplicateObject', 'Thread32First', 'OpenSCManagerW', 'CreateServiceW', 'GetFileType', 'MoveFileWithProgressW', 'NtDeviceIoControlFile', 'GetFileInformationByHandleEx', 'CopyFileA', 'NtLoadKey', 'GetNativeSystemInfo', 'NtOpenProcess', 'CryptUnprotectMemory', 'InternetWriteFile', 'ReadProcessMemory', 'gethostbyname', 'WSASendTo', 'NtOpenSection', 'listen', 'WSAStartup', 'socket', 'OleInitialize', 'FindResourceA', 'RegOpenKeyExA', 'RegEnumKeyExA', 'NtQueryDirectoryFile', 'CertOpenSystemStoreW', 'ControlService', 'LdrGetProcedureAddress', 'GlobalMemoryStatus', 'NtSetInformationFile', 'OutputDebugStringA', 'GetAdaptersAddresses', 'CoInitializeSecurity', 'RegQueryValueExA', 'NtQueryFullAttributesFile', 'DeviceIoControl', '__anomaly__', 'DeleteFileW', 'GetShortPathNameW', 'NtGetContextThread', 'GetKeyboardState', 'RemoveDirectoryA', 'InternetSetStatusCallback', 'NtResumeThread', 'SetFileInformationByHandle', 'NtCreateSection', 'NtQueueApcThread', 'accept', 'DecryptMessage', 'GetUserNameExW', 'SizeofResource', 'RegQueryValueExW', 'SetWindowsHookExW', 'HttpOpenRequestW', 'CreateDirectoryW', 'InternetOpenA', 'GetFileVersionInfoExW', 'FindWindowA', 'closesocket', 'RtlAddVectoredExceptionHandler', 'IWbemServices_ExecMethod', 'GetDiskFreeSpaceExW', 'TaskDialog', 'WriteConsoleW', 'CryptEncrypt', 'WSARecvFrom', 'NtOpenMutant', 'CoGetClassObject', 'NtQueryValueKey', 'NtDelayExecution', 'select', 'HttpQueryInfoA', 'GetVolumePathNamesForVolumeNameW', 'RegDeleteValueW', 'InternetCrackUrlA', 'OpenServiceA', 'InternetSetOptionA', 'CreateDirectoryExW', 'bind', 'NtShutdownSystem', 'DeleteUrlCacheEntryA', 'NtMapViewOfSection', 'LdrGetDllHandle', 'NtCreateKey', 'GetKeyState', 'CreateRemoteThread', 'NtEnumerateValueKey', 'SetFileAttributesW', 'NtUnmapViewOfSection', 'RegDeleteValueA', 'CreateJobObjectW', 'send', 'NtDeleteKey', 'SetEndOfFile', 'GetUserNameExA', 'GetComputerNameA', 'URLDownloadToFileW', 'NtFreeVirtualMemory', 'recvfrom', 'NtUnloadDriver', 'NtTerminateThread', 'CryptUnprotectData', 'NtCreateThreadEx', 'DeleteService', 'GetFileAttributesW', 'GetFileVersionInfoSizeExW', 'OpenSCManagerA', 'WriteProcessMemory', 'GetSystemInfo', 'SetFilePointer', 'Module32FirstW', 'ioctlsocket', 'RegEnumKeyW', 'RtlCompressBuffer', 'SendNotifyMessageW', 'GetAddrInfoW', 'CryptProtectData', 'Thread32Next', 'NtAllocateVirtualMemory', 'RegEnumKeyExW', 'RegSetValueExA', 'DrawTextExA', 'CreateToolhelp32Snapshot', 'FindWindowW', 'CoUninitialize', 'NtClose', 'WSARecv', 'CertOpenStore', 'InternetGetConnectedState', 'RtlAddVectoredContinueHandler', 'RegDeleteKeyW', 'SHGetSpecialFolderLocation', 'CreateProcessInternalW', 'NtCreateDirectoryObject', 'EnumWindows', 'DrawTextExW', 'RegEnumValueW', 'SendNotifyMessageA', 'NtProtectVirtualMemory', 'NetUserGetLocalGroups', 'GetUserNameW', 'WSASocketW', 'getaddrinfo', 'AssignProcessToJobObject', 'SetFileTime', 'WriteConsoleA', 'CryptDecodeObjectEx', 'EncryptMessage', 'system', 'NtSetContextThread', 'LdrLoadDll', 'InternetGetConnectedStateExA', 'RtlCreateUserThread', 'GetCursorPos', 'Module32NextW', 'RegCreateKeyExA', 'NtLoadDriver', 'NetUserGetInfo', 'SHGetFolderPathW', 'GetBestInterfaceEx', 'CertControlStore', 'StartServiceA', 'NtWriteFile', 'Process32FirstW', 'NtReadVirtualMemory', 'GetDiskFreeSpaceW', 'GetFileVersionInfoW', 'FindFirstFileExW', 'FindWindowExW', 'GetSystemWindowsDirectoryA', 'RegOpenKeyExW', 'CoCreateInstance', 'NtQuerySystemInformation', 'LookupPrivilegeValueW', 'NtReadFile', 'ReadCabinetState', 'GetForegroundWindow', 'InternetCloseHandle', 'FindWindowExA', 'ObtainUserAgentString', 'CryptCreateHash', 'GetTempPathW', 'CryptProtectMemory', 'NetGetJoinInformation', 'NtOpenKey', 'GetSystemDirectoryW', 'DnsQuery_A', 'RegQueryInfoKeyA', 'NtEnumerateKey', 'RegisterHotKey', 'RemoveDirectoryW', 'FindFirstFileExA', 'CertOpenSystemStoreA', 'NtTerminateProcess', 'NtSetValueKey', 'CryptAcquireContextA', 'SetErrorMode', 'UuidCreate', 'RtlRemoveVectoredExceptionHandler', 'RegDeleteKeyA', 'setsockopt', 'FindResourceExA', 'NtSuspendThread', 'GetFileVersionInfoSizeW', 'NtOpenDirectoryObject', 'InternetQueryOptionA', 'InternetReadFile', 'NtCreateFile', 'NtQueryAttributesFile', 'HttpSendRequestW', 'CryptHashMessage', 'CryptHashData', 'NtWriteVirtualMemory', 'SetFilePointerEx', 'CertCreateCertificateContext', 'DeleteUrlCacheEntryW', '__exception__']
* ACKNOWLEDGMENTS *
We would like to thank: Cuckoo Sandbox for developing such an amazing dynamic analysis environment!
VirusShare! Because sharing is caring!
Universidade Nove de Julho for supporting this research.
Coordination for the Improvement of Higher Education Personnel (CAPES) for supporting this research.
* CITATIONS *
"Oliveira, Angelo; Sassi, Renato José (2019): Behavioral Malware Detection Using Deep Graph Convolutional Neural Networks. TechRxiv. Preprint." at https://doi.org/10.36227/techrxiv.10043099.v1 Please feel free to contact me for any further information.
Collecting and analysing heterogeneous data sources from the Internet of Things (IoT) and Industrial IoT (IIoT) are essential for training and validating the fidelity of cybersecurity applications-based machine learning. However, the analysis of those data sources is still a big challenge for reducing high dimensional space and selecting important features and observations from different data sources.
Boğaziçi University DDoS dataset (BOUN DDoS) is generated in Boğaziçi University via Hping3 traffic generator software by flooding TCP SYN, and UDP packets. This dataset includes attack-free user traffic as well as attack traffic and suitable for evaluating network-based DDoS detection methods. Attacks are towards one victim server connected to the backbone router of the campus. Attack packets have randomly generated spoofed source IP addresses. The data-trace was recorded on the backbone and included over 4000 active hosts.
Bo ğaziçi University DDoS dataset (BOUN DDoS) is generated in Bo ğaziçi University via Hping3 traffic generator software
by flooding TCP SYN, and UDP packets. This dataset includes attack-free user traffic as well as attack traffic and suitable for
evaluating network-based DDoS detection methods. Attacks are towards one victim server connected to the backbone router of
the campus. Attack packets have randomly generated spoofed source IP addresses. The data-trace was recorded on the backbone
and included over 4000 active hosts.
The dataset includes two different attack scenarios. In both scenarios, randomly generated spoofed IP addresses are used in
a flooding manner. For TCP flood attacks, TCP port 80 is used as the destination port. All of the datasets lasted 8 minutes.
In each of them, 80 seconds waiting period, then 20 seconds attack period is practiced. Different packet rates are used to let
researchers evaluate their detection methods concerning different packets rates.
The TCP SYN Flood and UDP flood datasets include attack rates of 1000, 1500, 2000 and 2500 packets/second. The
topology of the attack is given in Figure 1.
Fig. 1. BOUN DDoS attack topology.
Attack packets can be distinguished from attack-free packets using the destination IP address of packets. The victim IP
address is 10.50.199.86.
II. DATASET STRUCTURE
Datasets are in comma-separated value file format, and have the following columns:
Time: Time values start from zero and have a resolution of 0.000001 seconds. Time values are expressed in seconds.
Frame Number: Frame number is simply the incremental count of packets in the dataset.
Frame length: Frame length is the length of that packet in bytes.
Source ip: Source IP address of the packet.
Destination IP: Destination Ip address of the packet.
Source Port: Source TCP port of the packet. If it is not a TCP packet, this field is empty.
Destination Port: Destination TCP port of the packet. If it is not a TCP packet, this field is empty
SYN: This value is “Set” if the packet is a TCP packet and its SYN flag is equal to one, it is equal to “Not Set” if the
packet is a TCP packet and its SYN flag is equal to zero. If the packet is not a TCP packet, this field is empty.
ACK: This value is “Set” if the packet is a TCP packet and its ACK flag is equal to one, it is equal to “Not Set” if the
packet is a TCP packet and its ACK flag is equal to zero. If the packet is not a TCP packet, this field is empty.
RST: This value is “Set” if the packet is a TCP packet and its RST flag is equal to one, it is equal to “Not Set” if the
packet is a TCP packet and its RST flag is equal to zero. If the packet is not a TCP packet, this field is empty.
TTL: Time to live value of the packets.
TCP Protocol: This value can be TCP or UDP if the packet belongs to a transport layer IP protocol. Else this value can
have different values.
We created various types of network attacks in Internet of Things (IoT) environment for academic purpose. Two typical smart home devices -- SKT NUGU (NU 100) and EZVIZ Wi-Fi Camera (C2C Mini O Plus 1080P) -- were used. All devices, including some laptops or smart phones, were connected to the same wireless network. The dataset consists of 42 raw network packet files (pcap) at different time points.
* The packet files are captured by using monitor mode of wireless network adapter. The wireless headers are removed by Aircrack-ng.
The dataset consists of 42 raw network packet files (pcap) at different time points.
* The packet files are captured by using monitor mode of wireless network adapter. The wireless headers are removed by Aircrack-ng.
* All attacks except Mirai Botnet category are the packets captured while simulating attacks using tools such as Nmap. The case of the Mirai Botnet category, the attack packets were generated on a laptop and then manipulated to make it appear as if it originated from the IoT device.
<packet file description>
benign-dec.pcap: benign-only traffic
mitm-arpspoofing-n(1~6)-dec.pcap: traffic containing benign and MITM(arp spoofing)
dos-synflooding-n(1~6)-dec.pcap: traffic containing benign and DoS(SYN flooding) attack
scan-hostport-n(1~6)-dec.pcap: traffic containing benign and Scan(host & port scan) attack
scan-portos-n(1~6)-dec.pcap: traffic containing benign and Scan(port & os scan) attack
mirai-udpflooding-n(1~4)-dec.pcap: traffic containing benign and 3 most typical attacks(UDP/ACK/HTTP Flooding) of zombie pc compromised by mirai malware
mirai-hostbruteforce-n(1~5)-dec.pcap: traffic containing benign and initial phase of Mirai malware including host discovery and Telnet brute-force attack
This dataset contains Cyber Threat Intelligence (CTI) data generated from public security reports and malware repositories.
The dataset is stored in a structured format (JSON) and includes approximately 640,000 records from 612 security reports published from January 2008 to June 2019.
Several data types are contained in this dataset such as URL, host, IP address, e-mail account, hashes (MD5, SHA1, and SHA256), common vulnerabilities and exposures (CVE), registry, file names ending with specific extensions, and the program database (PDB) path.
For more instruction about the dataset as well as the system generating the dataset, please see following paper:
Daegeon Kim and Huy Kang Kim, “Automated Dataset Generation System for Collaborative Research of Cyber Threat Analysis,” Security and Communication Networks, vol. 2019, Article ID 6268476, 10 pages, 2019. https://doi.org/10.1155/2019/6268476.
This FFT-75 dataset contains randomly sampled, potentially overlapping file fragments from 75 popular file types (see details below). It is the most diverse and balanced dataset available to the best of our knowledge. The dataset is labeled with class IDs and is ready for training supervised machine learning models. We distinguish 6 different scenarios with different granularity and provide variants with 512 and 4096-byte blocks. In each case, we sampled a balanced dataset and split the data as follows: 80% for training, 10% for testing and 10% for validation.
See documentation (readme.md).
This dataset contains the library call lists obtained from programs implemented by using libiec61850. Call lists are marked either as benign, or according to the name of the attack.
Each file is a sequential list of library calls, separated by a newline. No special attention is required in processing the files.
This dataset details the state machine based experiments of PowerWatch.
PowerWatch Experiment Summaries
This dataset summarizes the experiments done for the PowerWatch paper. The accompanying code will be
shared after the paper is published.
There are 2 files:
* expres.csv: Each entry in this file represents the summary with respect to a unique state machine, represented by
the field "state_machine_id".
* runres.csv: For each state machine, a total of 45 runs are conducted, each individual runs are represented by
The fields in "expres.csv" are explained as follows.
* state_machine_id: A number uniquely identifies an experiment. The ID was also used as a random seed.
The naming here is, unfortunately, confusing.
* bucket_size: Chosen bucket size.
* window_size: Chosen window size.
Next 12 fields represent the "complexity" of the machine with respect to call lists they emit.
In each experiment, two machines were run: benign and malicious. The difference between those are that, in the
malicious machine, there is one more state emitting an unique call list.
* cumulative_call_size_benign: Sum of the number of call lists emitted by benign states.
* mean_call_size_benign: Mean of the call lists emitted by benign states.
* variance_call_size_benign: Variance of the call lists emitted by benign states.
* malicious_state_call_size: Number of calls emitted by the malicious state.
* malicious_state_vocabulary_size: Number of different calls emitted by the malicious state.
* cumulative_edit_distance_every_state: The edit distance between every state. Represents
how the individual computing states vary from each other.
* mean_edit_distance_every_state: Mean of the edit distance computed between every state.
* variance_of_edit_distance_every_state: Variance of the edit distances computed between every state.
* cumulative_edit_distance_good_bad: Total edit distance computed between every benign state and the malicious state.
* mean_edit_distance_good_bad: Mean edit distance computed between every benign state and the maliicous state.
* min_edit_distance_good_bad: Minimum of edit distances computed between every benign state and the maliicous state.
* variance_edit_distance_good_bad: Variance of edit distances computed between every benign state and the maliicous state.
* training_time: Total time required for training the machine learning model.
* prediction_time: Total time required for prediction stage.
* svm_accuracy: Accuracy of a SVM model taking inputs of maximum activity signal per run.
* svm_margin: Unused.
* mean_benign_train_activity_index: Mean activity index, calculated on the training set.
* mean_benign_test_activity_index: Mean activity index, calculated on the data obtained from the benign machine, but not
used for training.
* mean_malicious_activity_index: Mean activity index, calculated on the data obtained from the malicious machine.
Originally, a cascade of max-pooling and convolution mechanism were considered, but we later decided to use a single
convolution step after the prediction stage. The naming of the fields are made with respect to the initial algorithm,
and a little misleading, explained below where necessary:
* state_machine_id: The ID of the associated experiment.
* run_number: The number of the run.
* malicious: If the run contained the malicious state.
* trained_on: If the resulting data used in training.
Remember that the first convolution yields the activity signal. Individual points in the activity signal are
the activity index. Statistics about the activity signal is given in the following fields:
* min_of_first_convolution: Minimum value of the first convolution. This is the minimum activity index in the activity signal.
* max_of_first_convolution: Maximum value of the first convolution. This is the minimum activity index in the activity signal.
* mean_of_first_convolution: Mean value of the first convolution. This is the minimum activity index in the activity signal.
* variance_of_first_convolution: Variance value of the first convolution. This is the minimum activity index in the activity signal.
* prediction_time: Time required to predict data generated in this run.
* reduction_time: Time required during the convolution stage.
* sp_accuracy: Accuracy of the predictor (predicting the next call).
* sp_misclassification: 1 - sp_accuracy.
* activity_index: This value was calculated WRT initial model, and completely useless in the final model. Disregard.
Modern technologies have made the capture and sharing of digital video commonplace; the combination of modern smartphones, cloud storage, and social media platforms have enabled video to become a primary source of information for many people and institutions. As a result, it is important to be able to verify the authenticity and source of this information, including identifying the source camera model that captured it. While a variety of forensic techniques have been developed for digital images, less research has been conducted towards the forensic analysis of videos.
Website fingerprinting attacks, which use statistical analysis on network traffic to compromise user privacy, have been shown to be effective even if the traffic is sent over anonymity-preserving networks such as Tor. The classical attack model used to evaluate website fingerprinting attacks assumes an on-path adversary, who can observe all traffic traveling between the user's computer and the secure network.
Untar the data.Every directory means different settings of the data collection:
linux_chrome: the data was collected on linux OS using chrome browser.
linux_ff59:the data was collected on linux OS using firefox browser.
linux_tor:the data was collected on linux OS using Tor browser.
win_chrome: the data was collected on windows10 OS using chrome browser.
win_ff59:the data was collected on windows10 oS using firefox browser.
mac_safari: the data was collected on Mac machine using safari browser.
linux_tor_counter :the data was collected on linux OS using Tor browser while running countermeasures.
CW: closed worls detting , every website appear several times.
OW: open worls setting, every website appear only one (or very few) times.
Use the following online colab script to run the test set on the classifiers;