RDD
A RDD is a resilient distributed dataset.
RDD[T] is a collection of values of type TRDDs are divided into “partitions”, which workers operate on independently.
RDDs can be made in a variety of ways:
sc.paralelize - spark context, makes any traversable into an RDDThere are a variety of operations which can be performed on RDDs, similar to MapReduce but with more variety.
Map-Like Operations

Map-like operations are one in which another RDD is returned.
map
T and outputs one value of type Ufilter
flatMap
T and outputs an iterable collection of type U
mapPartitions
Function $f$ is given an iterator that produces values of type T, returns iterator/iterable collection that produces values of type U
Each worker will call $f$ once on an iterator that traverses all items in worker’s partition of input RDD
Each worker traverses iterable returned, each value added to output RDD
Useful when doing setup/cleanup for MapReduce

e.g. useful when some common computation is needed for each computation