Datasets
Standard Dataset
Pristine and Malicious URLs
- Citation Author(s):
- Submitted by:
- Ehsan Nowroozi
- Last updated:
- Mon, 11/06/2023 - 09:43
- DOI:
- 10.21227/2ph5-xc09
- Links:
- License:
Abstract
The goal of our research is to identify malicious advertisement URLs and to apply adversarial attack on ensembles. We extract lexical and web-scrapped features from using python code. And then 4 machine learning algorithms are applied for the classification process and then used the K-Means clustering for the visual understanding. We check the vulnerability of the models by the adversarial examples. We applied Zeroth Order Optimization adversarial attack on the models and compute the attack accuracy.
Datasets are taken from different sources available on the internet. We have considered 12 different datasets which consist of 6 malicious and 6 benign URLs. The dataset includes about 3980870 URLs. We extracted the 89 lexical and web scrapped features for the further task.
The experiment setup for advertising URLs from 12 distinct datasets includes 3980870 URLs. There are two kinds of URLs in these contained in these datasets: benign and malicious. Furthermore, the malicious URL dataset includes four distinct sub-categories: spam, defacement, malware, and phishing. We also examined all of the URLs using the VirusTotal tool to confirm their authenticity.
Dataset Files
- URL Datasets URL Datasets.zip (4.09 MB)
- Scripts Scripts.zip (38.55 kB)
Comments
Thank you for your datasets.
I cannot download the dataset even after registering.
qi liu