Introduction

Data Augmentation is now a critical component in the ML pipeline. They help the models to achievebetter performance by augmenting the original data samples with a fixed number of pre-defined functions. The intuition is that by randomly applying augmentations to the original inputs, the machine learning models would see a more diverse set of samples at training. These models will then have a better generalization (eg. a better test accuracy) when deployed.

Plain augmentations and regularizations

Pytorch has a nice summarisation of possible augmentations

Illustration of transforms - Torchvision main documentation

One particularly interesting augmentation is the later proposed CutOut

Improved Regularization of Convolutional Neural Networks with Cutout

This method randomly masks out certain square regions in both inputs and activations, and served nicely as a regularization/augmentation method, this is in fact very similar to DropBlock.

DropBlock: A regularization method for convolutional networks

Cutmix later caught a lot of attentions since it proposed to mix the inputs:

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

Complex augmentations

People then have looked at how to combine a series of augmentations to build an augmentation pipeline. Both AutoAug and TrivilAug belongs to this degree.

TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation

Skill requirements

The candidate should be experienced in Object Orientated Programming in Python. Ideally, the candidate should have experience or at least willing to learn various Machine Learning frameworks in Python (such as Pytorch and Pytorch Lightning).

Fast AutoAugment

Proposed research