PyTorch Multi-GPU 학습

Single Node Single GPU : 하나의 컴퓨터에 하나의 GPU
Single Node Multi GPU : 하나의 컴퓨터에 여러 개의 GPU
Multi Node Multi GPU : 여러 컴퓨터에 여러 개의 GPU

다중 GPU에 학습을 분산하는 두 가지 방법

Model Parallel

과거부터 활용된 방법. — AlexNet

모델의 병목, 파이프라인의 어려움으로 인해 난이도가 높다.

Untitled

class ModelParallelResNet50(ResNet):
		def __init__(self, *args, **kwargs): 
				super(ModelParallelResNet50,self).__init_(
					Bottleneck, [3, 4, 6, 3], num_classes=num_classes, *args, **kwargs)
		
				self.seq1 = nn.Sequential(
						self.conv1, self.bn1, self.relu, self.maxpool, self.layer1, self.layer2
				).to('cuda:0') # 첫번째 모델을 cuda 0에 할당
				
				self.seq2 = nn.Sequential(
						self.layer3, self.layer4, self.avgpool,).to('cuda:1')
				
				self.fc.to('cuda:1') # 두번째 모델을 cuda 1에 할당

		def forward(self, x):
				x = self.seq2(self.seq1(x).to('cuda:1')) # 두 모델을 연결하기.
				return self.fc(x.view(x.size(0), -1))

Data Parallel

데이터를 나눠 GPU에 할당 후 결과의 평균을 취하는 방법

minibatch 동작을 한 번에 여러 GPU에서 수행한다.

Untitled

Forward

scatter mini-batch inputs to GPUs
Replicate model on GPUs
Parallel forward passes
Gather outputs on GPU-1

Backward

Compute loss gradients on GPU-1
Scatter gradients to GPUs
Parallel backward passes
Reduce gradients to GPU-1

PyTorch에서 제공하는 방식

DataParallel

단순히 데이터를 분배한 후 평균을 취한다.

GPU 사용 불균형 문제가 발생한다.
Batch 사이즈 감소(한 GPU가 병목), GIL

비교적 간편하다. nn.DataParallel만 추가해주면 된다.

parallel_model = torch.**nn.DataParallel**(model) # Encapsulate the model

predictions = parallel_model(inputs). # Forward pass on multi-GPUs
loss = loss_function(predictions, labels) # Compute loss function 
loss**.mean().backward()** # Average GPU-losses +backward pass 
optimizer**.step()** # Optimizer step
predictions = parallel_model(inputs) # Forward pass with new parameters

DistributedDataParallel

앞에서 언급했던 모으는 작업이 없고, 각각 연산을 한 후 gradient의 평균치를 반영하는 방식.

GPU뿐만 아니라, CPU도 할당을 해주기 때문에 가능한 방식.

Sampler을 만들어주어야 한다.
Shuffle = False로 설정하고, pin_memory = True로 설정

pin_memory : 메모리에 데이터를 바로 올릴 수 있도록 절차를 간소화하여 데이터를 저장하는 방식
num_workers에 가진 GPU의 개수를 설정.

train_sampler = torch.utils.data.distributed.DistributedSampler(train_data) 
shuffle = False
pin_memory = True
trainloader = torch.utils.data.DataLoader(train_data, batch_size=20, shuffle=True
pin_memory=pin_memory, num_workers=3,
shuffle=shuffle, sampler=train_sampler)

def main():
		n_gpus = torch.cuda.device_count()
		torch.multiprocessing.**spawn**(main_worker, nprocs=n_gpus, args=(n_gpus,)) 

def main_worker(gpu, n_gpus):
		image_size = 224
		batch_size = 512
		num_worker = 8
		epochs = ...
		# batch size와 num_worker를 gpu개수로 나눈다.
		batch_size = int(**batch_size / n_gpus**)
		num_worker = int(**num_worker / n_gpus**)
		# 멀티프로세싱 통신 규약 정의
		torch.distributed.init_process_group(backend='nccl',
																				 init_method='tcp://127.0.0.1:2568',
																				 world_size=n_gpus,
																				 rank=gpu)
		model = MODEL
		torch.cuda.set_device(gpu)
		model = model.cuda(gpu)
		# Distributed DataParallel 정의
		model = torch.nn.**parallel.DistributedDataParallel**(model, device_ids=[gpu])

from multiprocessing import Pool
# Python의 멀티프로세싱 코드
**def f(x):
		return x*x**

if __name__ == '__main ':
		with Pool(5) as p:
		print(p.map(f, [1, 2, 3]))