1. Introduction

Semi-supervised learning (SSL)
Consistency training methods
- small noise에 대해 model prediction이 불변하도록 regularize
- noise가 어떻게, 어디서 적용되는지가 항상 다름
  - 대표적으로 Gaussian noise, dropout noise, adversarial noise
Unsupervised Data Augmentation (UDA)
- consistency training에서 noise injection의 역할
- advanced data augmentation methods
- traditional data injection methods를 consistency training의 성능을 향상시킬 수 있는 high quality data augmentation method로 대체

2. Unsupervised Data Augmentation (UDA)

$x$ : the input
$y^*$ : ground-truth prediction target
$p_\theta(y|x)$ : a model
- $\theta$ : the model parameters
$L$ and $U$ : the sets of labeled and unlabeled examples

2.1 Background: Supervised Data Augmentation

$q(\hat{x}|x)$ : the augmentation transformation ($\hat{x}$ : augmented examples)
- augmented example을 negative log-likelihood를 간단히 최소화가능
- augmented set으로 추가적인 유도 편차를 효과적으로 제공하는 것이 필요 → augmentation transformation을 디자인하는 방법은 매우 중요
많은 연구를 통해 augmentation의 효과가 증명되어왔음
- 하지만 작은 크기의 labeled examples에만 적용되어왔다는 점에서 일정하지만 제한된 성능개선을 보여줌 → 더 개선될 여지가 충분함! (cherry on the cake)

2.2 Unsupervised Data Augmentation

semi-supervised learning은 모델의 smoothness를 강화하기 위해 unlabeled example을 사용
- Given an input $x$, compute the output distribution $p_\theta(y|x)$ given $x$ and a noised version $p_\theta(y|x,\epsilon)$ by injecting a small noise $\epsilon$. The noise can be applied to $x$ or hidden states.
- Minimize a divergence metric between the two distributions $\mathcal{D}(p_\theta(y|x)||p_\theta(y|x,\epsilon))$.
모델을 noise $\epsilon$에 대해 덜 민감하게 만들고 input (or hidden) space의 변화에 대해 smoother하게 만듬
- consistency loss를 최소화하는 것은 label information을 labeled examples에서 unlabeled ones으로 점차 propagate
supervised learning에서 strong data augmentation은 semi-supervised consistency training framework에서 noise unlabeled example을 사용할 때 좋은 성능을 보일 수 있음
- 더 다양하고 자연스러운 advanced data augmentation은 supervised setting에서 좋은 성능을 보일 수 있기 때문
좋은 state-of-the-art data augmentation 세트를 사용해서 supervised setting에 noise를 넣고 unlabeled example에 대해 동일한 consistency training objective를 최적화

$$ \text{min}{\theta}\mathcal{J}(\theta)=\mathbb{E}{x,y^\in L}[-\log p_\theta (y^ | x)] + \lambda \mathbb{E}{x\in U} \mathbb{E}{\hat{x}\sim q(\hat{x}|x)}[\mathcal{D}\text{KL}(p{\tilde{\theta}}(y|x) || p_\theta(y|\hat{x}))] $$

joint training with labeled examples
- $\lambda$ (=1) : a weighting factor to balance the supervised cross entropy and the unsupervised consistency training loss
- $q(\hat{x}|x)$ : a data augmentation transformation
- $\tilde{\theta}$ : a fixed copy of the current parameters $\theta$ indicating that the gradient is not propagated through $\tilde{\theta}$
- different batch sizes for the supervised data and the unsupervised data
- simple augmentations including cropping and flipping for labeled examples
- supervised training과 prediction on unlabeled examples의 discrepancy를 줄이기 위해, unlabeled examples에도 같은 simple augmentation 적용