Network/disk latencies are expensive compared to other operations ⇒ reducing transferring intermediate data increases algorithmic efficiency.
- Combiners may substantially reduce the number and size of key value pairs that need to shuffled from mappers to reducers
Combiners are one mechanism for local aggregation.
- combiners reduce the number of intermediate key-value pairs that need to be shuffled across network
- at best, we reduce down the set of unique keys
- Extending this idea, can aggregate all counts for a mapper and emit all pairs in cleanup - this is known as “in-mapper combining”
- e.g. In-mapper combining, word count:

Advantages of In-Mapper Combining
- we have control over when local aggregation occurs and how it exactly takes place, vs. combiner in which we don’t know whether it will be applied or not
- Typically more efficient than using actual combiners
- less overhead in actually materializing the key-value pairs (regular combiners don’t reducer number of key-value pairs emitted by mappers in the first place)
Drawbacks of In-Mapper Combining
- breaks functional programming underpinnings of MapReduce (preserving state)
- not a bit deal, need to be pragmatic
- State preservation across instances means algorithmic behavior may depend on order in which input key value pairs are encountered ⇒ potential for order-dependent bugs
- Scalability bottleneck - we depend on having sufficient memory to store intermediate results until mapper has completely processed all key-value pairs in input split
- i.e. we only scale to a point where results can be held in memory
- possible solution to limit memory usage is to “block” new key-value pairs and “flush” in-memory data structures periodically (i.e. emit partial results after processing very $n$ KVPs)
- must determine flush size empirically, difficult to coordinate given how multiple tasks can be running
- often we have diminishing returns from increasing buffer size ⇒ not worth effort to search for optimal buffer size
Extent to which efficiency can be increased through local aggregation depends on:
- size of intermediate key space
- distribution of keys
- number of key value pairs emitted by each individual map task