Sue Hyun Park, September 13.

<aside> 🔗 We have reposted this blog on our Medium publication. Read this on Medium.

</aside>

As a deep learning (DL) task becomes more complex, a DL model should have more model weights to boost accuracy. Through deep neural network (DNN) training, the model weights are repeatedly adjusted with regard to the given set of training data. One concern is that state-of-the-art DL models have millions of weights while the number of training samples is typically much smaller. It is critical to increase the training dataset so that trained DL models ensure generalization, the ability to properly process unseen data.

Data augmentation is widely used to do the trick. It is a practice to apply random transformations on existing training samples, providing additional distinct training samples. Take a look at a DNN pipeline below. Until target validation accuracy is met, two steps — data preparation step and gradient computation step — are repeated for multiple epochs. The data augmentation pipeline operates inside the data preparation step. After training data undergo two RandAugment layers, each of which randomly applies one of the 14 distortions (e.g., shear, rotate, and solarize), and a random crop and random flip layer, the size of the dataset increases exponentially.

Detailed DNN training pipeline that includes the data augmentation pipeline using two RandAugment layers.

<aside> ✅ An epoch is a complete pass on the entire training set. A step indicates the processing of a subset of samples, which we call a mini-batch, within an epoch.

</aside>

Data augmentation gives greater variation in the training set and helps train a more generalized DL model. However, the multi-layered computations in data augmentation often heavily burden the CPU and can be responsible for degrading the training speed.

In this post, we outline our recent publication about a new intermediate data augmentation technique. We propose data refurbishing, a novel sample reuse mechanism that accelerates DNN training while preserving model generalization. We also design and implement a new data loading system, Revamper, to realize data refurbishing.

Our research paper appeared at the 2021 USENIX Annual Technical Conference. We will introduce how we analyzed the training speed bottleneck and derived the idea of data refurbishing, and then show Revamper's architecture and its performance advantages.

Overhead of Data Augmentation

The data preparation step is generally performed on the CPU. On the other hand, gradient computation requires computationally expensive forward computation and backward propagation, necessitating accelerators like GPUs and TPUs. But thanks to the recent development of specialized hardware accelerators such as NVIDIA A100 and Google TPU v3, gradient computation has gained a dramatic speedup. Meanwhile, the augmentation pipeline performs random transformations through multiple layers, which is computationally burdensome to the CPU. The resulting heavy CPU overhead has become the bottleneck of DNN training.

To measure the impact, we analyze the training throughput of ResNet-50 trained on ImageNet using a varying number of RandAugment layers. Without a RandAugment layer, the training throughput reaches the maximum gradient computation speed of the GPU. As the amount of RandAugment layers increase, training throughput notably decreases.

ResNet-50 training speed on ImageNet, varying the number of RandAugment layers. The horizontal line indicates the gradient computation speed on GPU.

Limitations of Existing Approaches and Our Questions

How can we reduce CPU overhead from data augmentation?

There were efforts to reduce computation overhead, but the stochastic nature of data augmentation jeopardized such approaches. The first approach, carried out by NVIDIA DALI and TrainBox, is using hardware accelerators with massive parallelisms such as GPUs and FPGAs. But leveraging massive parallelism is incompatible with the stochastic characteristics of augmentation pipelines.

Instead of changing the hardware, the second approach, data echoing from Google, attempts to cut down the amount of computation by reusing training samples. For better understanding, we first illustrate the standard training with an augmentation pipeline. In this case, the stochastic augmentation is independently applied for each epoch, and augmented images are produced uniquely.

A high-level illustration of standard training. The stochastic augmentation is repeated for every epoch.