Regularization
Dark knowledge : 잘못된 예측의 knowledge
→ Knowledge distillation에서 처음으로 중요성이 입증됨
Class-wise Self-Knowledge Distillation (CS-KD)
Softmax classifier for fully-supervised classification tasks
$$ P(y|\mathbf{x};\theta,T)=\frac{\exp(f_y(\mathbf{x};\theta)/T)}{\sum^C_{i=1}\exp(f_i(\mathbf{x};\theta)/T)} $$
같은 클래스의 sample들에 대해 일정한 predictive distribution을 유도
Class-wise regularization loss
$$ \mathcal{L}_{\text{cls}}(\mathbf{x},\mathbf{x}';\theta,T):=\text{KL}(P(y|\mathbf{x}';\tilde{\theta},T)||P(y|\mathbf{x};\theta,T)) $$
Total training loss $\mathcal{L}_{\text{CS-KD}}$
$$ \mathcal{L}{\text{CS-KD}}(\mathbf{x},\mathbf{x}',y;\theta,T):=\mathcal{L}{\text{CE}}(\mathbf{x},y;\theta)+\lambda_{\text{cls}}\cdot T^2\cdot \mathcal{L}_{\text{cls}}(\mathbf{x},\mathbf{x}';\theta,T) $$
$\mathcal{L}_{\text{CE}}$ : the standarc cross-entropy loss
$\lambda>0$ : a loss weight for the class-wise regularization
original KD와 동일하게 the square of the temperature $T^2$ 적용
$$ \begin{aligned} q_i&=\frac{\exp(z_i/T)}{\sum_j\exp(z_j/T)} &&&&&&(1)\\ \frac{\partial{C}}{\partial{z_i}}&=\frac{1}{T}(q_i-p_i)=\frac{1}{T}(\frac{e^{z_i/T}}{\sum_j e^{z_j/T}}-\frac{e^{v_i/T}}{\sum_j e^{v_j/T}}) &&&&&&(2)\\ \frac{\partial{C}}{\partial{z_i}}&\approx \frac{1}{T}(\frac{1+z_i/T}{N+\sum_j z_j/T}-\frac{1+v_i/T}{N+\sum_j v_j/T}) &&&&&&(3)\\ \frac{\partial{C}}{\partial{z_i}}&\approx \frac{1}{NT^2}(z_i-v_i) &&&&&&(4) \end{aligned} $$
Preventing overconfident predictions
Reducing the intra-class variations
Softmax의 prediction value 조사
Log-probabilities of the softmax scores