Details on Reduce vs ReduceByKey vs CombineByKey vs AggregateByKey vs GroupByKey | Notion

Avoid using groupByKey if possible, as it is similar to the old MapReduce process.

reduceByKey
- Less flexible, but does what reduce (MapReduce) should normally be used for
- Not like reduce of MapReduce
  - Partitions RDD, reduce each partition, then shuffle for final reduce
  - Final parameter - number of partitions

combineByKey
- More fine-grained control provided if needed
- Note that reduceByKey is simply combineByKey(identity, reduce, reduce)
aggregateByKey
- Midpoint between reduceByKey and combineByKey
- Zero value is provided instead of an initialize function

If reduce action needs to know what particular key is (i.e. some keys must be treated differently), groupByKey into a map or mapPartitions may be needed
- Maybe just use compound value in this case

Note that we can also repartition - this triggers shuffling, but we can get more balanced partitions.

can also increase/decrease number of partitions
coalesce - should only use to reduce number of partitions
- Avoids full shuffle (faster than repartition)
- May give unbalanced partitions