DeepCuts: A Deep Learning Optimization Framework for Versatile GPU Workloads

Seyeon An July 31, 2021

<aside> 🔗 We have reposted this blog on our Medium publication. Read this on Medium.

</aside>

Battle of Deep Learning Frameworks

Widely Used Deep Learning Optimization Frameworks— by now

If you are a deep learning engineer, or more generally, if you are in the field of artificial intelligence, you must have used either of these deep learning frameworks: TensorRT by NVIDIA, Pytorch by Facebook, or TensorFlow by Google. Deep learning technology is essential everywhere— in artificial intelligence, or in big data. That is why those deep learning frameworks—that allow engineers to build optimized deep learning models more easily and quickly without getting into the details of underlying algorithms— are important. This is also why most of these Big Tech companies are investing in building the best Deep Learning Framework. Companies like Google and Facebook have put in enormous efforts to develop state-of-the-art deep learning software.

DeepCuts

Then, which software has won this battle of deep learning frameworks? Such question is hard to answer, since even the most popular and most extensively used DL softwares have their own problems. Yet, DeepCuts— which was developed by a group of researchers from Thunder Research Group, Seoul National University— is definitely one worth noting.

The Problem with Existing Deep Learning Frameworks

The Performance-Flexibility Trade-off Dilemma that DeepCuts Aims to Solve

GPUs are the de facto standard to run DL applications. Almost every widely used DL frameworks, such as TensorFlow, PyTorch, and MXNet, support GPU acceleration via cuDNN provided by NVIDIA. It is the state-of-the-art DL primitive library, which acts as the smallest unit of processing, to accelerate DL computations.

However, using primitive libraries as cuDNN does not guarantee the best performance. These show poor performance as the convolutions of the deep learning network and the hardware are diversified. Moreover, they lack of kernel fusion (a well-known optimization method to reduce GPU global memory accesses between consecutively executed kernels by merging them into a single kernel) functionality. cuDNN supports kernel fusion only for a few DL workload patterns, as a sequence of a convolution, a bias addition, and a ReLU activation. However, it is not sufficient to handle various DL operation patterns found in emerging DL workloads.

There have also existed DL models that did not rely on hand-tuned GPU kernels, but their performance was relatively poor. In other words, previous DL optimization frameworks, or DL compilers had a trade-off dilemma of speed (performance) and flexibility:

Category 1 : frameworks that heavily rely on hand-tuned GPU kernels
- TensorFlow XLA and TensorRT
- Still use hand-tuned kernels for core routines (e.g., convolution)
- Fast, but not very flexible
Category 2 : frameworks that use ML-based GPU kernel optimizer
- TVM and Tensor Comprehensions
- Optimize kernels using ML-based performance estimation models
- Flexible, but the performance is relatively poor

Then, How about DeepCuts?

DeepCuts, which is a DL optimization framework just like TensorFlow, considers both kernel implementation parameters and GPU architecture parameters to build an optimized code. In DeepCuts, the flexible code generator supports versatile types of DL operations, and the kernel optimizer uses architectural information of target GPU. DeepCuts achieves higher performance than existing state-of-the-art DL optimization frameworks (Apache TVM, Google TensorFlow XLA, NVIDIA TensorRT).

Overall Structure of DeepCuts

Then how does DeepCuts achieve such difference?

DeepCuts takes the whole computational graph of given workload and correspondingly generates a set of GPU kernels.

The input graph is a graph that describes the computation and data flow of a DNN model. An edge of the graph represents a tensor of data and a node represents a tensor operation such as convolution. When the graph is given as an input to the GPU, each node corresponds to a GPU kernel call or a DNN library function call, as cuDNN. The input graph is similar to the computational graph of PyTorch or TensorFlow.