1. Introduction

semi-suprevised learning : labeled and unlabeled 데이터를 모두 사용

2. Related Work

Entropy Minimization
- Encourage the model to output confident predictions on unlabeled data
- unlabeled 데이터의 confidence가 labeled 데이터의 특정 클래스에 대해 높아지도록
- Explicitly add a cross entropy loss term to the unlabeled samples
$$ p_{\text{model}}(y|x; \theta) $$
- Pseudo-Label
  - Construct hard labels from high-confidence predictions on unlabeled data
  - Train the network using these pseudo-label and labeled data
Consistency Regularization
- Encourage the model to produce the same output distributions when its inputs are perturbed
- augmentation에 대해서 output에 대해 consistency를 가지도록
  
  $$ ||p_{\text{model}}(y | \text{Augment}(x);\theta)-p_{\text{model}}(y|\text{Augment}(x);\theta)||^2_2 $$
- Mean Teacher
  - teacher와 student사이의 prediction distribution의 consistency
$$ J(\theta)=\mathbb{E}{x,\eta^{\prime},\eta}[||f(x,\theta^{\prime},\eta^{\prime})-f(x,\theta,\eta)||^2] \newline \theta_t^{\prime}=\alpha \theta{t-1}^{\prime}+(1-\alpha)\theta_t $$
- Virtual Adversarial Training
  - LDS : input에 무언가 더해진 입력에 대한 prediction distribution의 distance를 minimize
    - r은 distance를 최대로 하는 방향으로 선택
  $$ \text{LDS}(x_,\theta) := D[p(y|x_,\hat{\theta}), p(y|x_+r_{\text{vadv}},\theta)] \newline r_{\text{vadv}} := \text{argmax}_{r;||r||2<=\epsilon} D[p(y|x,\hat{\theta}), p(y|x_*+r)] $$
Generic Regularization
- Encourage the model to generalize well to avoid overfitting to the small labeled data
- 적은 labeled 데이터를 쓰다보니 regularization을 많이 넣어줘서 사용
- Imposing a constraint on a model to avoid overfitting (memorize the small labeled data)
  - L2 norm of the model parameters
  - Weight decay
  - MixUp
    - Encourage the model to have strictly linear behavior between examples
      
      $$ \tilde{x} = \lambda x_i + (1-\lambda) x_j, \text{where } x_i, x_j \text{ are raw input vectors} \newline \tilde{y} = \lambda y_i + (1-\lambda) y_j, \text{where } y_i, y_j \text{ are one-hot label encodings} $$
MixMatch는 위의 세 방법을 모두 사용!

3. MixMatch

the combined loss L

$$ \mathcal{X}^\prime, \mathcal{U}^\prime = \text{MixMatch}(\mathcal{X}, \mathcal{U}, \mathcal{T}, \mathcal{K}, \alpha) \newline \mathcal{L}\mathcal{X} = \frac{1}{\mathcal{X}^\prime}\sum{x,p\in\mathcal{X}^\prime} \text{H}(p,\text{p}\text{model}(y|x;\theta)) \newline \mathcal{L}\mathcal{U} = \frac{1}{L|\mathcal{U}^\prime|} \sum_{u,q\in\mathcal{U}^\prime} ||q-\text{p}\text{model}(y|u;\theta)||^2_2 \newline \mathcal{L} = \mathcal{L}\mathcal{X} + \lambda_\mathcal{U}\mathcal{L}_\mathcal{U} $$

Data Augmentation
- Stochastic transformation of the datapoint in such a way that its label remains unchanged
- Apply data augmentation on both labeled and unlabeled data (Rotation, Flip, etc)
Label Guessing
- Compute the average of the model’s predicted classed distributions across all the K augmentations
$$ \bar{q}b = \frac{1}{K}\sum^K{k=1}\text{p}{\text{model}}(y|\hat{u}{b,k};\theta $$
Sharpening
- Add Sharpening for entropy minimization
- Sharpening function is used to reduce the entropy of the label distributions
$$ \text{Sharpen}(p,T)_i := p^{\frac{1}{T}}i / \sum^L{j=1}p^{\frac{1}{T}}_j $$
MixUp
- Apply MixUp both labeled and unlabeled samples with label guessing
- Mix labeled and unlabeled samples for MixUp
$$ \lambda \sim \text{Beta}(\alpha, \alpha) \newline \lambda^\prime=\text{max}(\lambda, 1-\lambda) \newline x^\prime=\lambda^\prime x_1 + (1-\lambda^\prime)x_2 \newline p^\prime = \lambda^\prime p_1 + (1-\lambda^\prime)p_2 $$
Loss Functions
- Standard loss function for the semi-supervised learning framework
$$ \mathcal{X}^\prime, \mathcal{U}^\prime = \text{MixMatch}(\mathcal{X}, \mathcal{U}, T, K, \alpha) \newline \mathcal{L}\mathcal{X}=\frac{1}{|\mathcal{X}^\prime|} \sum{x,p\in\mathcal{X}^\prime} \text{H}(p,\text{p}{\text{model}}(y|x;\theta)) \newline \mathcal{L}\mathcal{U} = \frac{1}{L|\mathcal{U}^\prime|} \sum_{u,q\in\mathcal{U}^\prime} ||q-\text{p}\text{model}(y|u;\theta)||^2_2 \newline \mathcal{L}=\mathcal{L}\mathcal{X}+\lambda_\mathcal{U}\mathcal{L}_\mathcal{U} $$
Hyper-parameters
- T: Temperature for sharpening (T=0.5)
- K: Number of data augmentation for unlabeled samples (K=2)
- \lambda_U: Unsupervised loss weight (\lambda_U=100)
- \alpha: MixUp parameter (\alpha=0.75)

1. Introduction

2. Related Work

3. MixMatch

4. Experiments

5. Conclusion