MapReduce
- programming model for expressing distributed computations on massive amounts of data
- execution framework for large-scale processing on clusters of commodity servers
- widely adopted via open-source implementation called Hadoop
- MapReduce provides an abstraction that hides many system-level details from the programmer
- more efficient to move code to data - spread data across local disks of nodes in a cluster, run processes on nodes that hold the data
- storage management is handled by distributed file system under MapReduce
MapReduce is based on functional programming.
-
higher-order function - function that can accept other function as argument
- e.g. map, fold
- map - apply function $f$
- fold - takes initial value, function $g$ applied to initial value, basically accumulate value
- map - concise way to transform a dataset
- fold - aggregation operation

Map + fold - fold applies function to every element in list, fold iteratively applies function to aggregate a result
- applying a map function $f$ to each item in a list can be easily parallelize (each operation is independent)
- fold - elements must have some data locality (elements need to be “brought together” before folded)
- fold function $g$ doesn’t necessarily need to be applied to all elements in the list ⇒ fold groups can be done in parallel
- for operations that are commutative and associative, can be efficient in fold through local aggregation, reordering
MapReduce essentially corresponds to doing a map + fold operation. As a result, we have a generic “recipe” for processing large datasets:
-
map - user-specified computation over all input records in dataset
- operations occur in parallel
-
reduce - intermediate output is aggregated by another user-specified computation
-
e.g. Word count in MapReduce

MapReduce can refer to:
- programming model (map + reduce/fold)
- execution framework (framework) that coordinates of programs written in this style
- implementation of programming model and execution framework
- e.g. Google proprietary implementation vs. open-source Hadoop
Mappers and Reducers
Key-value pairs are main data structure in MapReduce.