PyTorch Troubleshooting

OOM(Out Of Memory)

왜 발생했는지 알기 어렵다

어디서 발생했는지 알기 어렵다.

Error backtracking이 이상한 곳으로 흘러간다.

메모리의 이전상황의 파악이 어렵다.

Batch Size 낮추기 → GPU Clean → Run

GPUUtil 사용하기

nvidia-smi처럼 GPU의 상태를 보여주는 모듈

Colab은 환경에서 GPU 상태를 보여주기가 편하다.

iter마다 메모리가 늘어나는지 확인하자.

!pip install GPUtil

import GPUtil
GPUtil.showUtilization()

torch.cuda.empty_cache() 사용하기

사용되지 않은 GPU상 cache를 정리한다.

가용 메모리를 확보한다.

del과는 다르다. del은 연결을 끊는 것 뿐이다.

reset 대신 쓰기 좋은 함수이다.

import torch
from GPUtil import showUtilization as gpu_usage
print("Initial GPU Usage")
gpu_usage()
tensorList = []
for x in range(10):
		tensorList.append(torch.randn(10000000,10).cuda())
print("GPU Usage after allcoating a bunch of Tensors")
gpu_usage()
del tensorList
print("GPU Usage after deleting the Tensors")
gpu_usage()
print("GPU Usage after emptying the cache")
torch.cuda.empty_cache()
gpu_usage()

tensor로 처리된 변수는 GPU상에 메모리가 올라간다.

그 중, required gradient가 된다면 gradient를 위해 memory buffer까지 잡아먹기 때문에 훨씬 더 많은 메모리를 잡아먹게 된다.

해당 변수가 loop 안에 있게되면 GPU에 computational graph를 생성해서 메모리를 잠식하게 된다.