RDD
A RDD is a resilient distributed dataset.
RDD[T]
is a collection of values of type T
RDDs are divided into “partitions”, which workers operate on independently.
RDDs can be made in a variety of ways:
sc.paralelize
- spark context, makes any traversable into an RDDThere are a variety of operations which can be performed on RDDs, similar to MapReduce but with more variety.
Map-Like Operations
Map-like operations are one in which another RDD is returned.
map
T
and outputs one value of type U
filter
flatMap
T
and outputs an iterable collection of type U
mapPartitions
Function $f$ is given an iterator that produces values of type T
, returns iterator/iterable collection that produces values of type U
Each worker will call $f$ once on an iterator that traverses all items in worker’s partition of input RDD
Each worker traverses iterable returned, each value added to output RDD
Useful when doing setup/cleanup for MapReduce
e.g. useful when some common computation is needed for each computation