May 18 Sue Hyun Park

<aside> 🔗 We have reposted this blog on our Medium publication. Read this on Medium.

</aside>

Using GPUs for deep learning (DL) is a standard, as they can perform computation concurrently. Recent DL frameworks like TensorFlow, PyTorch, and MXNet run models on GPUs to improve DL inference and training speed. Such frameworks automatically launch DL operators on GPUs, freeing users from manually handling GPU intricacies.

<aside> 💡 A DL operator represents numerical computations, like convolution and batch normalization, and consists of one or more GPU tasks: GPU kernels and GPU memory operations (e.g., memory copy).

</aside>

To ultimately submit tasks to the GPU, a DL framework first represents a neural network as a computation graph of DL operators. Then it goes through a series of preparation steps, where it selects the next operator to run, and dispatches proper GPU kernel(s) based on the shape of the input tensor. We call this series of steps GPU task scheduling.

How DL frameworks use GPUs

Existing DL frameworks conduct GPU task scheduling during run time and this may significantly limit framework performance—the execution of neural networks can take longer than required for the amount of computation assigned to GPUs.

Current DL frameworks may encounter suboptimal situations where they cannot fully utilize the computation power of GPUs.

This post introduces Nimble, a DL execution engine that solves existing DL frameworks' inefficiencies. First, we point out two important problems in run time GPU task scheduling. Next, we describe two novel techniques the system employs to improve framework performance.

Why Run Time GPU Task Scheduling in Existing DL Frameworks is a Problem

1. High Scheduling Overhead Makes GPUs Idle

We experimentally notice that GPU idle time dominates the overall running time of DL execution, as the left graph shows. Both TensorFlow and PyTorch leave their GPUs idle for a substantial portion of the running time, up to 71% and 91%, respectively.

In particular, is the performance bottleneck located at the scheduling procedure of the framework? We write a C++ program that can only perform the inference of a specific neural network, which uses the same GPU tasks as PyTorch but has some of the scheduling overhead reduced by hardcoding. The lightweight version exhibits up to 2.37 times speedup compared to PyTorch during inference, as the right graph shows. The result confirms that the main source of GPU idle time is the prohibitive amount of overhead in the scheduling procedure.

Ratio of GPU idle time to the overall running time on DL inference (batch size: 1)

Inference latencies of PyTorch and its scheduling-minimized version

2. Non-Parallel GPU Task Execution

GPUs have a powerful capability to run thousands of threads in parallel. There are two ways to fully utilize the GPU power:

Make the most use of GPU cores in a single GPU stream 🤔 Make every GPU kernel maintain a sufficient level of intra-kernel parallelism so that each can take full advantage of GPU's computation power on its own.