Graphs with Big Data (parallel BFS)

Consider the size of the web in terms of graph size:

Thinking of vertices as pages, we have 4.77 billion - 50 billion (i.e. large changes per day)
Thinking of links as directed edges, we have 100 billion - 1 trillion links

We have a power law for the fraction of pages that have $k$ links, i.e. $P(k) \sim k^{-\gamma}$

Lots of pages have 0-1 in-links

Note Fibonacci binning ⇒ bin size increases by Fibonacci sequence, should use if plotting to log-log space.

The most common out-degree of a web page is also 1

It is clear that the web graph is big data, so we need Hadoop. We observe that many graph algorithms involve:

local computations for each node
propagating these results to neighbors (i.e. graph traversal - think Dijkstra’s for example)

Thus, we must consider:

Best representation for a graph?
- Adjacency list - easy to consider Key-value pairs, as keys are vertices, values are the corresponding adjacency lists
How to perform local computations?
- Local computation ⇒ notion of independence, easier to model with MapReduce
How to perform traversal? (propagation)
- Collecting propagated messages to other vertices is reduce-like (sending messages to neighbors is non-local computation)
- If we override hash partitioner, we can try to partition in a way to minimize number of inter-partition edges

We will often do a BFS.

Single Source Shortest Path

In this problem, we want to find the shortest path from a single source node to all other nodes. Note that “shortest” in this context can refer to fewest links or the lowest total weight.