Spark Concepts

RDD

A RDD is a resilient distributed dataset.

RDDs are divided into “partitions”, which workers operate on independently.

RDDs can be made in a variety of ways:

There are a variety of operations which can be performed on RDDs, similar to MapReduce but with more variety.

Map-Like Operations

Map-like operations are one in which another RDD is returned.

map
- A function $f$ is given one value of type T and outputs one value of type U
- Each worker will call $f$ on each item $x$ in input RDD
- Each worker puts value returned by $f$ to partition of output RDD
filter
- Provide some predicate function $f$ which determines whether to remove values or not
flatMap
- A function $f$ is given one value of type T and outputs an iterable collection of type U
  - Could be Sequence, List, Option
- Each worker calls $f$ on each item $x$ in input RDD
- Each returned value from $f$ is put in partition of output RDD
mapPartitions
- Function $f$ is given an iterator that produces values of type T, returns iterator/iterable collection that produces values of type U
- Each worker will call $f$ once on an iterator that traverses all items in worker’s partition of input RDD
- Each worker traverses iterable returned, each value added to output RDD
- Useful when doing setup/cleanup for MapReduce
- e.g. useful when some common computation is needed for each computation