Consider the size of the web in terms of graph size:
- Thinking of vertices as pages, we have 4.77 billion - 50 billion (i.e. large changes per day)
- Thinking of links as directed edges, we have 100 billion - 1 trillion links
We have a power law for the fraction of pages that have $k$ links, i.e. $P(k) \sim k^{-\gamma}$
- Lots of pages have 0-1 in-links

Note Fibonacci binning ⇒ bin size increases by Fibonacci sequence, should use if plotting to log-log space.
- The most common out-degree of a web page is also 1

It is clear that the web graph is big data, so we need Hadoop. We observe that many graph algorithms involve:
- local computations for each node
- propagating these results to neighbors (i.e. graph traversal - think Dijkstra’s for example)
Thus, we must consider:
- Best representation for a graph?
- Adjacency list - easy to consider Key-value pairs, as keys are vertices, values are the corresponding adjacency lists
- How to perform local computations?
- Local computation ⇒ notion of independence, easier to model with MapReduce
- How to perform traversal? (propagation)
- Collecting propagated messages to other vertices is reduce-like (sending messages to neighbors is non-local computation)
- If we override hash partitioner, we can try to partition in a way to minimize number of inter-partition edges
We will often do a BFS.
Single Source Shortest Path
In this problem, we want to find the shortest path from a single source node to all other nodes. Note that “shortest” in this context can refer to fewest links or the lowest total weight.