YARN

YARN (Yet-Another-Resource-Negotiator) provides an API to develop any generic distributed application.

image.png

image.png

image.png

Spark Architecture

There is an “executor” rather than a worker, which executes tasks.

image.png

Usually, there are multiple tasks sent to an executor. the Spark driver must send relevant code to run each task. This can be bad.

Broadcast

If a value is Broaadcast, Spark will only send one copy of the value per Executor, not per task.

thresh = sc.broadcast(5)
myRdd.filter(lambda x: x > thresh.value)