DAG Scheduler
Under the hood, Spark builds a dependency graph (task graph).

- Automatically pipelines functions
- Aware of data locality
- Partitioning aware (i.e. attempts to avoid shuffles)
- RDDs have a flag which indicate whether they are partitioned by key - if so, what partitioning rule
- Co-partitioning is when the partition of 1 RDD only depends on another
Physical Operators

- Narrow dependency - much faster than wide dependency, as no shuffling of data between working nodes is required
reduceByKey
, groupByKey
, etc. will have narrow dependencies if the upstream RDD is already partitioned by key