Web Data Commons - Hyperlink Graphs

The graphs have been extracted from the 2012 and 2014 versions of the Common Crawl web corpera. The 2012 graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. The2014 graph covers 1.7 billion web pages connected by 64 billion hyperlinks. Below we provide instructions on how to download the graphs as well as basic statistics about their topology.

We hope that the graphs will be useful for researchers who develop

  • search algorithms that rank results based on the hyperlinks between pages.
  • SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
  • graph analysis algorithms and can use the hyperlink graphs for testing the scalability and performance of their tools.
  • Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.



We also provide the page graph in the format expected by the WebGraph Framework developed by Sebastiano Vigna. The graph is represented using three files: .graph, .offsets, .properties. All three are necessary to load the network into the library.

Using the WebGraph Framework, which can be downloaded from Maven Central, these files can be loaded using the following line of code: BVGraph graph = BVGraph.loadMapped(baseName, new ProgressLogger()).

The extracted data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus.

The Web Data Commons extraction framework can be used under the terms of the Apache Software License.

Submit an Analysis

Dataset Files

You must login with an IEEE Account to access these files. IEEE Accounts are FREE.

Sign Up now or login.

OPEN ACCESS Dataset Details

Citation Author(s):
Common Crawl Corpus
Submitted by:
Alexander Outman
Last updated:
Wed, 01/18/2017 - 16:36
Data Format:

Categories & Keywords


[1] Common Crawl Corpus, "Web Data Commons - Hyperlink Graphs", IEEE Dataport, 2017. [Online]. Available: http://dx.doi.org/10.21227/H23S3B. Accessed: Oct. 23, 2017.
doi = {10.21227/H23S3B},
url = {http://dx.doi.org/10.21227/H23S3B},
author = {Common Crawl Corpus },
publisher = {IEEE Dataport},
title = {Web Data Commons - Hyperlink Graphs},
year = {2017} }
T1 - Web Data Commons - Hyperlink Graphs
AU - Common Crawl Corpus
PY - 2017
PB - IEEE Dataport
UR - 10.21227/H23S3B
ER -
Common Crawl Corpus. (2017). Web Data Commons - Hyperlink Graphs. IEEE Dataport. http://dx.doi.org/10.21227/H23S3B
Common Crawl Corpus, 2017. Web Data Commons - Hyperlink Graphs. Available at: http://dx.doi.org/10.21227/H23S3B.
Common Crawl Corpus. (2017). "Web Data Commons - Hyperlink Graphs." Web.
1. Common Crawl Corpus. Web Data Commons - Hyperlink Graphs [Internet]. IEEE Dataport; 2017. Available from : http://dx.doi.org/10.21227/H23S3B
Common Crawl Corpus. "Web Data Commons - Hyperlink Graphs." doi: 10.21227/H23S3B