A Dataset of Network Traffic Collected During Large-Scale Human Genome Sequence Analysis

Citation Author(s):
Manas
Das
University of Missouri
Khawar
Shehzad
University of Missouri
Praveen
Rao
University of Missouri
Submitted by:
Praveen Rao
Last updated:
Tue, 05/30/2023 - 17:04
DOI:
10.21227/y0t5-1w13
Data Format:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

This dataset contains .pcap files collected during the execution of variant calling on large number of human genomes using a cluster. The GATK4 variant calling pipeline was executed using AVAH  in two testbeds, CloudLab and FABRIC. A 16-node cluster was used on CloudLab, and an 8-node cluster was used on FABRIC. The files were collected by running tcpdump on the network interfaces of the nodes. One file was produced every 30 mins; a snapshot length of 94 bytes was specified for tcpdump. On CloudLab, bare metal serves were used for the cluster. On FABRIC, virtual machines were used for the cluster. Each .pcap file is named with a worker/host name and start time when the network traffic was collected for that file.

Instructions: 

1. Download the .pcap.tar.gz files. The name of the testbed is provided as a prefix for each tar ball. Each tar ball corresponds to traffic send/received by one worker node in the cluster.

2. Unzip/untar the files using tar. For example, use: tar xvfz <testbed>-vm1.tar.gz  

3. Use Wireshark (https://www.wireshark.org/) or tshark (https://www.wireshark.org/docs/man-pages/tshark.html) to analyze the traffic data.

Funding Agency: 
National Science Foundation
Grant Number: 
2201583, 2034247