Intro to Apache Spark

Spark is a fast and expressive cluster computing engine compatible with Apache Hadoop.

Efficient
- Up to 10x faster on disk, 100x in memory
- Has general execution graphs, in-memory storage
Usable
- 2-5x less code compared to Hadoop
- Rich Java, Scala, Python APIs
- Interactive shell

Programs are written in terms of operations on distributed datasets (RDDs).

Spark does lazy evaluation - nothing is done until an action is reached.

For fault recovery, RDDs track lineage information that can be used to efficiently recompute lost data.