Abstract

This repository contains the results of 30 public Internet browsing experiments, from a computer at the campus network of the Public University of Navarre, out of which 20 used plaintext HTTP browsing, while 10 used HTTPS. We present both the original data sources in the form of network packet traces and HAR waterfalls, as well as the processed results formatted as line-based text files.

Instructions:

Each experiment consisted of a Selenium-automated web browser (Google Chrome 80.0) visiting a set of predefined web sites, with all caching options disabled. Both network packet traffic traces and in-browser measurements were collected. The network measurements were collected using tcpdump running at the client, while in-browser measurements were collected through the HAR Export Trigger extension. We have uploaded both sets of files.

The sets of websites for HTTP and HTTPS experiments are different, as modern web sites usually support HTTPS but not HTTP. The HTTPS set was obtained by collecting the top 2000 web sites from the Alexa Top Ranking. The HTTP set is the subset of these top 2000 websites, those that supported plain-text HTTP. To extend the amount of measurements of plain HTTP traffic, each of these websites was crawled, following the embedded ‘http://’ links.

For each web resource requested by the browser, we computed the time elapsed between the HTTP request being sent and the response being fully received; this is referred to as the resource's response time. Each response time obtained, along with the URL for that resource, and the timestamp at which the request was made, is referred to as a sample. These samples are obtained from the browser measurements and from network traffic. For the HTTPS experiments, the network data was decrypted using the ephemeral per-session encryption keys generated by the web browser. The files containing these keys have also been uploaded.

A number of resources are requested more than once during each test, such as cascade style sheets or images. Although we deactivated the cache, the browser still sometimes reported some resources as requested with a false response time of zero, since the request is never issued to the server but obtained from a cache. Also, a small number of requests trigger an exception in the browser, which prevents data being collected at the client side, although the request and response are present in the network traffic. These behaviours complicate one-to-one comparisons between network and in-browser measurements because a different number of response times for a specific resource may be found in the network traffic and in the browser report. We exported to text files only the first response time seen for each resource with a unique URL. This filtering removes false measurements reported by the browser. In case this filtering is not desired, all the data can be obtained from the pcap and HAR files uploaded.

The dataset contains the original PCAP and HAR files, and also the post-processed files obtained from them. The raw data is contained in the raw_http.zip and raw_https.zip files, while the post-processed files are contained in the data.zip file. Inside the data.zip archive there are two directories, corresponding to the HTTP and HTTPS experiments respectively.

Both raw data archives contain files named X.pcap and subdirectories named X_har (with X being the name of each individual experiment), corresponding to the data gathered from network traces and in-browser measurements respectively. Inside each X_har directory, a .har file is stored for each visited site with the full download waterfall. Additionally, decryption keys for the HTTPS experiments are provided, under the name of X.key.

The data.zip archive contains three files for each experiment, amounting to a total of 60 and 30 files for HTTP and HTTPS respectively.

The three files describing each experiment contain line-based text data, and are named X_network_tresp.txt, X_browser_tresp.txt and X_conn_info.txt, with X being the name of each individual experiment. The first two files contain, on each line, space-separated fields describing a single request-response sample. X_network_tresp.txt contains the information gathered from network traces, while X_browser_tresp.txt was obtained from browser instrumentation. On the other hand, X_conn_info.txt contains, on each line, space-separated fields related to each TCP connection present during the experiment, obtained through network traces.

The connections in X_conn_info.txt and the samples in X_network_tresp.txt are associated through a unique connection ID field present in each line in both files. Note that this is a one-to-many relationship, meaning that a connection ID is associated to a single TCP stream (i.e. line in X_conn_info.txt), but one or more samples (i.e. lines in X_network_tresp.txt).

We describe below the line format for each file. This information is included as well in the "format.txt" file, located on the top level directory of the compressed archive.

X_conn_info.txt:

Connection ID

RTT (milliseconds)

Number of retransmissions

Number of sequence holes

Number of data packets, client to server

Number of data packets, server to client

X_network_tresp.txt:

Request timestamp (seconds)

Response time (seconds)

Requested URL

Response size (bytes)

Connection ID

X_browser_tresp.txt:

Request timestamp (seconds)

Response time (seconds)

Requested URL

Dataset Files

Processed data data.zip (343.36 MB)
Raw data (HTTPS) raw_https.zip (58.00 GB)
Raw data (HTTP part 2) raw_http.z01.zip (50.00 GB)
Raw data (HTTP part 1) raw_http.z00.zip (9.63 GB)
Raw data (HTTP part 3) raw_http.z02.zip (50.00 GB)
Raw data (HTTP part 4) raw_http.z03.zip (50.00 GB)

LOGIN TO ACCESS DATASET FILES
Open Access dataset files are accessible to all logged in users. Don't have a login? Create a free IEEE account. IEEE Membership is not required.

QUESTIONS?

Report a problem with this Dataset

Datasets

Open Access

In-browser and network traffic based web response time measurements

Abstract

Dataset Files

QUESTIONS?