1.Introduction

Regularization
- 네트워크의 parameter가 많아질수록 overfitting과 poor generation이 발생
  - early stopping, $L_1/L_2$-regularization, dropout, batch normalization, data augmentation
- Regularizing the prdictive distribution of DNNs
  - Label-smoothing, entropy maximization, angular-margin
  - Network calibration, novelty detection, exporation in reinforcement learning에도 영향미침
Dark knowledge : 잘못된 예측의 knowledge

→ Knowledge distillation에서 처음으로 중요성이 입증됨
Class-wise Self-Knowledge Distillation (CS-KD)
- 같은 클래스의 다른 sample에 대한 predictive distribution을 matching or distilling
- 같은 클래스의 sample들이 잘못된 예측을 하더라도 비슷한 예측을 하도록 유도
  - Predictive distribution의 일관성 (consistency)
- Preventing overconfident prediction & Reducing intra-class variation
- 다른 regularization 방법들보다 낮은 top-1 error rate
- 더 좋은 top-5 error rate와 expected calibration error
- 최근 self-distillation 방식들보다 좋은 top-1 error rate
- Mixup, knowledge distillation 등 방법들과 합쳤을 때 더 좋은 성능

2. Class-wise Self-Knowledge Distillation

Softmax classifier for fully-supervised classification tasks

$$ P(y|\mathbf{x};\theta,T)=\frac{\exp(f_y(\mathbf{x};\theta)/T)}{\sum^C_{i=1}\exp(f_i(\mathbf{x};\theta)/T)} $$
- $\mathbf{x}\in\mathcal{X}$ : input
- $y\in\mathcal{Y}=\{1,...,C\}$ : its ground-truth label

Preventing overconfident predictions
- 다른 sample들의 model-prediction을 soft-label로 사용
  - Label-smoothing ($y^{LS}_k=y_k(1-\alpha)+\alpha/K$) 보다 현실적
Reducing the intra-class variations
- 같은 클래스의 두 logit 사이의 거리를 최소화
Softmax의 prediction value 조사
- PreAct ResNet-18 trained on the CIFAR-100
- CIFAR-100에서 잘못 예측한 데이터로 확인
- Overconfident prediction 완화
- Ground-truth 클래스의 prediction value 강화
Log-probabilities of the softmax scores
- (a) 잘못 예측한 sample의 confident prediction이 낮음
- (b) 잘못 예측한 sample의 ground-truth class의 score가 높음

Dataset
- CIFAR-100, TinyImageNet : datasets for conventional classification tasks
- CUB-200-2011, Standford Dogs, MIT67 : datasets for fine-grained classification tasks
  - 시각적으로 유사한 클래스들이 존재, 클래스당 training sample이 적음
- ImageNet : a large-scale classification task
Network architecture
- ResNet-18 with 64 filters, DenseNet-121 with a growth rate of 32 : fine-grained classification
- PreAct ResNet-18 : conventional classification