DAG Scheduler/Physical Operators | Notion

DAG Scheduler

Under the hood, Spark builds a dependency graph (task graph).

Automatically pipelines functions
Aware of data locality
Partitioning aware (i.e. attempts to avoid shuffles)
RDDs have a flag which indicate whether they are partitioned by key - if so, what partitioning rule
Co-partitioning is when the partition of 1 RDD only depends on another

Physical Operators

Narrow dependency - much faster than wide dependency, as no shuffling of data between working nodes is required
reduceByKey, groupByKey , etc. will have narrow dependencies if the upstream RDD is already partitioned by key
- Less common