Pairs & Stripes

Common approach for synchronization is to construct keys and values in a way that data necessary for a computation is naturally brought together by the execution framework.

A common problem is co-occurrence of values.

Pairs - keys are pairs of desired ids.

Pros
- values tend to be simpler
- easy to implement/understand
Cons
- large number of key-value pairs (quadratic)
- combiners don’t help much
  - $N \times N$ potential keys - most keys have few entries, so not many cases where combiner helps
e.g. word co-occurence

Stripes - keys are same, values are a map with all associated values.

Pros
- less key-value pairs compared to pairs, fewer and shorter intermediate keys (less sorting)
- combiners can do more work (more likely to have same key)
Cons
- values are more complex (serialization + deserialization overhead)
- map may not fit in memory

Both algorithms benefit from combiners - respective operations in reducers are both commutative and associative.

combiners with stripes have more opportunities to perform local aggregation - key space is vocabulary
less opportunity with pairs - need to encounter exact pair match
- this also limits opportunities for in-memory combining - mapper can run out of memory to store partial counts

In terms of scalability:

stripes assumes that its map is small enough to fit in memory, otherwise memory paging will significantly impact performance.