Local Aggregation (In-mapper combining)

Network/disk latencies are expensive compared to other operations ⇒ reducing transferring intermediate data increases algorithmic efficiency.

Combiners may substantially reduce the number and size of key value pairs that need to shuffled from mappers to reducers

Combiners are one mechanism for local aggregation.

combiners reduce the number of intermediate key-value pairs that need to be shuffled across network
- at best, we reduce down the set of unique keys
Extending this idea, can aggregate all counts for a mapper and emit all pairs in cleanup - this is known as “in-mapper combining”
e.g. In-mapper combining, word count:

Advantages of In-Mapper Combining

we have control over when local aggregation occurs and how it exactly takes place, vs. combiner in which we don’t know whether it will be applied or not
Typically more efficient than using actual combiners
- less overhead in actually materializing the key-value pairs (regular combiners don’t reducer number of key-value pairs emitted by mappers in the first place)

Drawbacks of In-Mapper Combining

breaks functional programming underpinnings of MapReduce (preserving state)
- not a bit deal, need to be pragmatic
State preservation across instances means algorithmic behavior may depend on order in which input key value pairs are encountered ⇒ potential for order-dependent bugs
Scalability bottleneck - we depend on having sufficient memory to store intermediate results until mapper has completely processed all key-value pairs in input split
- i.e. we only scale to a point where results can be held in memory
- possible solution to limit memory usage is to “block” new key-value pairs and “flush” in-memory data structures periodically (i.e. emit partial results after processing very $n$ KVPs)
- must determine flush size empirically, difficult to coordinate given how multiple tasks can be running
- often we have diminishing returns from increasing buffer size ⇒ not worth effort to search for optimal buffer size

Extent to which efficiency can be increased through local aggregation depends on: