By Chaminda Bandara (wgcban.com)
Setting up Distributed Data Parallel
Based on the following repos:
Video series: https://youtu.be/-K3bZYHYHEA?si=6oJA65LybhwDYotp
Github: https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series
PyTorch Tutorial: https://pytorch.org/tutorials/beginner/ddp_series_theory.html
Informative lecture: https://youtu.be/TibQO_xv1zc?si=3daCXI9m5bhSC5X1
model = MyModel()
model = nn.DataParallel(model)
Identical copies of the model, different sub-batches, synchronized updates
l_1
and l_2
are different. Hence, needs to synchronize the updates. Some communication happens.DataParallel | DistributedDataParallel |
---|---|
More overhead; model is replicated and destroyed at each forward pass | Model is replicated only once |
Only support single node parallelism | Supports calling to multiple machines |
Slower; uses multithreading on a single process and runs into Global Interpreter Lock (GIL) contention | Faster (no GIL connection) because it uses multiprocessing |