Spark is a fast and expressive cluster computing engine compatible with Apache Hadoop.

Programs are written in terms of operations on distributed datasets (RDDs).

image.png

Spark does lazy evaluation - nothing is done until an action is reached.

For fault recovery, RDDs track lineage information that can be used to efficiently recompute lost data.

image.png