MS-BioGraphs: Trillion-Scale Sequence Similarity Graph Datasets

Citation Author(s):
Mohsen
Koohi Esfahani
Queen's University Belfast, University of Sistan & Baloochestan
Sebastiano
Vigna
Università degli Studi di Milano
Paolo
Boldi
Università degli Studi di Milano
Hans
Vandierendonck
Queen's University Belfast
Peter
Kilpatrick
Queen's University Belfast
Submitted by:
Mohsen Koohi Es...
Last updated:
Sun, 05/05/2024 - 02:46
DOI:
10.21227/gmd9-1534
Data Format:
Link to Paper:
Links:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

MS-BioGraphs are a family of sequence similarity graph datasets with up to 2.5 trillion edges. The graphs are weighted edges and presented in compressed WebGraph format. The dataset include symmetric and asymmetric graphs. The largest graph has been created by matching sequences in Metaclust dataset with 1.7 billion sequences. These real-world graph dataset are useful for measuring contributions in High-Performance Computing and High-Performance Graph Processing. Moreover, they  provide  a representation of the data   acts as a new source for extracting domain-specific information and knowledge by deploying graph algorithms.  Sequence similarity graphs have several usages in biology including sequence clustering,  predicting pseudo-gene functions, effective selection of conotoxins, predicting evolution  and gene transfer.

Instructions: 

What are these files?

The files in this dataset contain 8 sequence similarity graphs created in MS-BioGraphs project  (https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs/).
Graph names and the total size of each (in TeraBytes) are :

(1) MS (11 TB)
(2) MSA500 (5 TB)
(3) MS200 (2.5 TB)
(4) MSA200 (2.5 TB)
(5) MS50 (0.7 TB)
(6) MA50 (0.7 TB)
(7) MSA10 (0.2 TB)
(8) MS1 (0.02 TB)

Each graph has some files that their names start with the name of graph. Assuming name XXX for the graph, the files are:

(a) XXX-underlying.graph
(b) XXX-underlying.properties
(c) XXX-underlying.offsets
These three files contain the underlying graph (edges without weights) in WebGraph format. If your application does not require weights of edges, you can use these files directly.
Note:  For the MS graph, the underlying.graph file is around 7TB and has been stored in two parts: `MS-underlying.graph.aa` and `MS-underlying.graph.ab`. As WebGraph needs a single `.graph` file, it is required to merge these two files using `cat`  (e.g., `cat MS-underlying.graph.?? > MS-underlying.graph` ) or `dd` before passing the graph to WebGraph or ParaGrapher.

(d) XXX-weights.labels
(e) XXX-weights.properties
(f) XXX-weights.labeloffsets
These three files contain the weights of the edges. When traversing the weighted (arc-labelled) graph, you need to pass the XXX-weights to the WebGraph.

(g) XXX_edges_shas.txt

This file contains the shasums of edge blocks where each block contains 64 Million continuous edges and has one shasum for its 64M endpoints and one for its 64M edge weights.
The file is used to validate the underlying graph and the weights. For further explanation about validation process, please visit  https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation.

(h) XXX_offsets.bin
The offsets array of the CSX (Compressed Sparse Rows/Columns) graph in binary format and little endian order. It consists of |V|+1 8-Bytes elements.  The first and last values are 0 and |E|, respectively.
This array helps converting the graph (or parts of it) from WebGraph format to binary format by one pass over (related) edges.

(i) XXX-wcc.bin
The Weakly-Connected Compontent (WCC) array in binary format and little endian order. This array consists of |V| 4-Bytes elements The vertices in the same component have the same values in the WCC array.

(j) XXX.ojson
The charactersitics of the graph and shasums of the files. It is in the open json format and needs a closing brace (}) to be appended before being passed to a json parser.

(k) XXX_trans_offsets.bin
The offsets array of the transposed graph in binary format and little endian order. It consists of |V|+1 8-Bytes elements. The first and last values are 0 and |E|, respectively.
It helps to transpose the graph by performing one pass over edges.

(l) XXX-n2o.bin
The New to Old (N2O) reordering array of the graph in binary format and little endian order. It consists of |V| 4-Bytes elements and identifies the old ID of each vertex which is used in searching the name of vertex (protein) in the `names.tar.gz` file.

(m) names.tar.gz
This compressed file contains 120 files in CSV format using ‘;’ as the separator. Each row has two columns: ID of vertex and name of the sequence.
Note: If the graph has a `XXX_n2o.bin` file, the n2o array should be used to convert the vertex ID to old vertex ID which is used for identifying name of the protein in the `names.tar.gz` file.

 

How to read these graphs?

- The graphs are presented in WebGraph format, https://webgraph.di.unimi.it/ and as arc-labelled graphs.
- For sample code and validation code, please refer to https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation .
- ParaGrapher (https://blogs.qub.ac.uk/DIPSA/ParaGrapher/) is a graph loading library that reads, decompresses, and passes the graph or its requested subgraph to your graph processing framework.

 

More Information About The Graphs?

More information on each graph has been provided in the following web pages which includes:
(1) |V|, |E|, max/min degree and weight, number of zero degree vertices, number of weakly connected components and size of the largest component
(2) Size of each file of the graph and its shasum
(3) In- and Out-Degree Distribution plots
(4) Weight and Vertex-Relative Weight Distribution plots
(5) Degree Decomposition plot
(6) Push and Pull Locality plots
(7) Cell-Binned Average Weight Degree Distribution plot
(8) Weakly-Connected Components Size Distribution plot

 

MS-BioGraphs MSMS-BioGraphs MSA500
MS-BioGraphs MS200MS-BioGraphs MSA200
MS-BioGraphs MS50MS-BioGraphs MSA50
MS-BioGraphs MSA10MS-BioGraphs MS1

 

Any Problem?

Please contact first author.

 

Dataset Files