Avoid using groupByKey
if possible, as it is similar to the old MapReduce process.
reduceByKey
combineByKey
reduceByKey
is simply combineByKey(identity, reduce, reduce)
aggregateByKey
Midpoint between reduceByKey
and combineByKey
Zero value is provided instead of an initialize function
groupByKey
into a map or mapPartitions
may be needed
Note that we can also repartition - this triggers shuffling, but we can get more balanced partitions.
coalesce
- should only use to reduce number of partitions