There are some frequent patterns used across map-reduce.
Synchronization within MapReduce occurs during shuffle/sort stage of intermediate key-value pairs which are copied from mappers to reducers, grouped by key.
- Otherwise, mappers/reducers run in isolation without mechanisms for direct control
- Programmer doesn’t control many aspects of execution, e.g.
- where mapper/reducer runs
- when mapper/reducer finishes
- which input key-value pairs are processed by a specific mapper
- which intermediate key-value pairs are processed by a specific reducer
- Programmer has some techniques for controlling execution/managing flow of data in MapReduce
- can construct complex data structures as keys/values in order to store/communicate partial results
- can execute initialization code at beginning of map/reduce task
- can execute termination code at end of map/reduce task
- can preserve state in mappers/reducers across multiple input or intermediate keys
- can control sort order of intermediate keys ⇒ order in which reducer encounters particular keys
- can control partitioning of key space ⇒ can control set of keys that a reducer will encounter
Many algorithms cannot easily be expressed as a single MapReduce job - need to decompose into a sequence of jobs.
- Iterative algorithms - need repeated execution until some convergence criteria, which cannot easily be expressed in MapReduce
- solution - external program that serves as “driver” to control iterations
We want MapReduce algorithms to be scalable (i.e. can deal with larger datasets) and efficient (no waste of resources) - linear scalability is ideal.
Local Aggregation (In-mapper combining)
Pairs & Stripes