malicious and benign websites

Citation Author(s):
Christian Urcuqui, Andrés Navarro, José Osorio, Melisa García
Submitted by:
Christian Urcuqui
Last updated:
Thu, 11/08/2018 - 10:34
DOI:
10.21227/H26Q1T
Data Format:
Links:
License:
4
1 rating - Please login to submit your rating.

Abstract 

One important topic to work is to create a good set of malicious web characteristics, because it is difficult to find one updated and with a research work to support it .

 

This dataset is a another research production of my bachelor students, this is a result of a project that consisted to evaluate classification models to predict malicious and benign websites through their application layer and network characteristics. The data were obtained by a process that included different sources of benign and malicious URL, all of them were verified and used in a low interactive client honeypot in order to get their network traffic, furthermore, we used some tools to get other more information, such as the server country with Whois.

 

This is the first version, but, we have some results of the application of machine learning classifiers in a bachelor thesis and in an article, so, all the data process making and the data description are in above works. But, maybe in the next days I will provide a resume of these in this page.

 

If your papers or other works use our dataset, please cite our paper as follows. Urcuqui, C., Navarro, A., Osorio, J., & Garcıa, M. (2017). Machine Learning Classifiers to Detect Malicious Websites. CEUR Workshop Proceedings. Vol 1950, 14-17.

 

If you need an article of the websites cybersecurity state of the art, you can find it in english and spanish: Urcuqui, C., Peña, M. G., Quintero, J. L. O., & Cadavid, A. N. (2017). Antidefacement. Sistemas & Telemática, 14(39), 9-27.
 

If you have any question or feedback, please do not dude to write at the next email:

ccurcuqui@icesi.edu.co

Instructions: 

Malicious websites are of great concern due it is a problem to analyze one by one and to index each URL in a black list. Unfortunately, there is a lack of datasets with malicious and benign web characteristics. This dataset is a research production of my bachelor students whose aims to fill this gap. 

This dataset is a another research production of my bachelor students, this is a result of a project that consisted to evaluate classification models to predict malicious and benign websites through their application layer and network characteristics. The data were obtained by a process that included different sources of benign and malicious URL, all of them were verified and used in a low interactive client honeypot in order to get their network traffic, furthermore, we used some tools to get other more information, such as the server country with Whois.

This is the first version, but, we have some results of the application of machine learning classifiers in a bachelor thesis and in an article, so, all the data process making and the data description are in above works. But, maybe in the next days I will provide a resume of these in this page.

If your papers or other works use our dataset, please cite our paper as follows. Urcuqui, C., Navarro, A., Osorio, J., & Garcıa, M. (2017). Machine Learning Classifiers to Detect Malicious Websites. CEUR Workshop Proceedings. Vol 1950, 14-17.

If you need an article of the websites cybersecurity state of the art, you can find it in english and spanish: Urcuqui, C., Peña, M. G., Quintero, J. L. O., & Cadavid, A. N. (2017). Antidefacement. Sistemas & Telemática, 14(39), 9-27.

If you have any question or feedback, please do not dude to write at the next email: 

ccurcuqui@icesi.edu.co

--------------------------------------------------------------

Data Description

 

  • URL: it is the anonimous identification of the URL analyzed in the study
  • URL_LENGTH: it is the number of characters in the URL
  • NUMBER_SPECIAL_CHARACTERS: it is number of special characters identified in the URL, such as, “/”, “%”, “#”, “&”, “. “, “=”
  • CHARSET: it is a categorical value and its meaning is the character encoding standard (also called character set).
  • SERVER: it is a categorical value and its meaning is the operative system of the server got from the packet response.
  • CONTENT_LENGTH: it represents the content size of the HTTP header.
  • WHOIS_COUNTRY: it is a categorical variable, its values are the countries we got from the server response (specifically, our script used the API of Whois).
  • WHOIS_STATEPRO: it is a categorical variable, its values are the states we got from the server response (specifically, our script used the API of Whois).
  • WHOIS_REGDATE: Whois provides the server registration date, so, this variable has date values with format DD/MM/YYY HH:MM
  • WHOIS_UPDATED_DATE: Through the Whois we got the last update date from the server analyzed
  • TCP_CONVERSATION_EXCHANGE: This variable is the number of TCP packets exchanged between the server and our honeypot client
  • DIST_REMOTE_TCP_PORT: it is the number of the ports detected and different to TCP
  • REMOTE_IPS: this variable has the total number of IPs connected to the honeypot
  • APP_BYTES: this is the number of bytes transfered
  • SOURCE_APP_PACKETS: packets sent from the honeypot to the server
  • REMOTE_APP_PACKETS: packets received from the server
  • APP_PACKETS: this is the total number of IP packets generated during the communication between the honeypot and the server
  • DNS_QUERY_TIMES: this is the number of DNS packets generated during the communication between the honeypot and the server
  • TYPE: this is a categorical variable, its values represent the type of web page analyzed, specifically, 1 is for malicious websites and 0 is for benign websites