MapReduce Basics

MapReduce

programming model for expressing distributed computations on massive amounts of data
execution framework for large-scale processing on clusters of commodity servers
widely adopted via open-source implementation called Hadoop
MapReduce provides an abstraction that hides many system-level details from the programmer
- more efficient to move code to data - spread data across local disks of nodes in a cluster, run processes on nodes that hold the data
- storage management is handled by distributed file system under MapReduce

MapReduce is based on functional programming.

MapReduce essentially corresponds to doing a map + fold operation. As a result, we have a generic “recipe” for processing large datasets:

map - user-specified computation over all input records in dataset
- operations occur in parallel
reduce - intermediate output is aggregated by another user-specified computation
e.g. Word count in MapReduce

MapReduce can refer to:

programming model (map + reduce/fold)
execution framework (framework) that coordinates of programs written in this style
implementation of programming model and execution framework
- e.g. Google proprietary implementation vs. open-source Hadoop

Mappers and Reducers

Key-value pairs are main data structure in MapReduce.