malicious and benign websites

Citation Author(s):
Submitted by:: Christian Urcuqui
Last updated:: Thu, 11/08/2018 - 15:34
DOI:: 10.21227/H26Q1T
Data Format:: CSV
Links:: article

author homepage

3998 views

Categories:

Keywords:

web security; machine learning; honeypot; network; webpage

ACCESS DATASET CITE

Abstract

One important topic to work is to create a good set of malicious web characteristics, because it is difficult to find one updated and with a research work to support it .

This dataset is a another research production of my bachelor students, this is a result of a project that consisted to evaluate classification models to predict malicious and benign websites through their application layer and network characteristics. The data were obtained by a process that included different sources of benign and malicious URL, all of them were verified and used in a low interactive client honeypot in order to get their network traffic, furthermore, we used some tools to get other more information, such as the server country with Whois.

This is the first version, but, we have some results of the application of machine learning classifiers in a bachelor thesis and in an article, so, all the data process making and the data description are in above works. But, maybe in the next days I will provide a resume of these in this page.

If your papers or other works use our dataset, please cite our paper as follows. Urcuqui, C., Navarro, A., Osorio, J., & Garcıa, M. (2017). Machine Learning Classifiers to Detect Malicious Websites. CEUR Workshop Proceedings. Vol 1950, 14-17.

If you need an article of the websites cybersecurity state of the art, you can find it in english and spanish: Urcuqui, C., Peña, M. G., Quintero, J. L. O., & Cadavid, A. N. (2017). Antidefacement. Sistemas & Telemática, 14(39), 9-27.

If you have any question or feedback, please do not dude to write at the next email:

ccurcuqui@icesi.edu.co

Instructions:

Malicious websites are of great concern due it is a problem to analyze one by one and to index each URL in a black list. Unfortunately, there is a lack of datasets with malicious and benign web characteristics. This dataset is a research production of my bachelor students whose aims to fill this gap.

If you have any question or feedback, please do not dude to write at the next email:

ccurcuqui@icesi.edu.co

--------------------------------------------------------------

Data Description

URL: it is the anonimous identification of the URL analyzed in the study
URL_LENGTH: it is the number of characters in the URL
NUMBER_SPECIAL_CHARACTERS: it is number of special characters identified in the URL, such as, “/”, “%”, “#”, “&”, “. “, “=”
CHARSET: it is a categorical value and its meaning is the character encoding standard (also called character set).
SERVER: it is a categorical value and its meaning is the operative system of the server got from the packet response.
CONTENT_LENGTH: it represents the content size of the HTTP header.
WHOIS_COUNTRY: it is a categorical variable, its values are the countries we got from the server response (specifically, our script used the API of Whois).
WHOIS_STATEPRO: it is a categorical variable, its values are the states we got from the server response (specifically, our script used the API of Whois).
WHOIS_REGDATE: Whois provides the server registration date, so, this variable has date values with format DD/MM/YYY HH:MM
WHOIS_UPDATED_DATE: Through the Whois we got the last update date from the server analyzed
TCP_CONVERSATION_EXCHANGE: This variable is the number of TCP packets exchanged between the server and our honeypot client
DIST_REMOTE_TCP_PORT: it is the number of the ports detected and different to TCP
REMOTE_IPS: this variable has the total number of IPs connected to the honeypot
APP_BYTES: this is the number of bytes transfered
SOURCE_APP_PACKETS: packets sent from the honeypot to the server
REMOTE_APP_PACKETS: packets received from the server
APP_PACKETS: this is the total number of IP packets generated during the communication between the honeypot and the server
DNS_QUERY_TIMES: this is the number of DNS packets generated during the communication between the honeypot and the server
TYPE: this is a categorical variable, its values represent the type of web page analyzed, specifically, 1 is for malicious websites and 0 is for benign websites