3.1 Building models. Considerations include:

Choosing ML framework and model architecture Modeling techniques given interpretability requirements

3.2 Training models. Considerations include:

Using distributed training to organize reliable pipelines”

Model Parallelism

Data parallelism

Sync vs Async Training

Sync training

The model sends different parts of the data into each accelerator or GPU
Every GPU has a complete copy of the model and is trained solely on a part of the data
all-reduce algo

Async training

Workers don’t have to wait for each other; all workers are independently training over the input data nad updating variables asynchronously

all-reduce-sync on TPUs

https://cloud.google.com/blog/products/ai-machine-learning/faster-distributed-training-with-google-clouds-reduction-server

Model parallism explained

Every model is partitioned into parts, just as with data parallelism. Each model is then placed on an individual GPU. Model parallelism can be used to overcome the limitations associated with training a model on a single GPU (memory bottleneck) by splitting the model (layers) on multiple GPUs.

Cloud TPUs - tf.distribute.Strategy tensorflow API to distribute training across multiple GPUs, multiple machines or TPUs

Screenshot 2025-04-02 at 11.04.20 PM.png