1. Introduction

Neural Architecture Search (NAS)
- neural architecture design을 reinforcement learning, differentiable search, evolutionary search, other algorithms로 자동화
Knowledge Distillation (KD)
- (보통 작은) student neural network를 (상대적으로 큰) teacher network의 supervision을 이용해 학습
- 이전 KD는 미리 정의된 student에 teacher’s knowledge를 전달
- 다른 teacher model마다 최적의 student architecture은 다를 수 있음!!
Architecture-aware Knowledge Distillation (AKD)
- 주어진 teacher model을 distilling할 때 best student architecture를 찾는 과정
- Reinforcement Learning (RL) based NAS process with a KD-based reward function
- ImageNet classification task에서 SOTA + 기존의 NAS보다 좋은 성능
- AKD로 찾은 ImageNet classification task optimal architecture가 millon-level face recognition 같은 다른 task에서도 좋은 성능을 보임

2. Knowledge distillation

Input space : I, output space : O
An ideal model is a connotative mapping function

$$ f : x \mapsto y, x \in \mathcal{I}, y \in \mathcal{O} $$
- The model’s conditional probability function : p(y|x)
The knowledge of a neural network

$$ \hat{f} : x \mapsto \hat{y}, x \in \mathcal{I}, \hat{y} \in \mathcal{O} $$
- The network’s conditional probability function : \hat{p}(\hat{y}|x)
The differences is the dark part of the neural network’s knowledge
- e.g. Margin between classes
  - One-hot output y constrains the angular distances between classes to the same 90 degree
  - Similar classes/samples should have smaller angular distances

dark knowledge distillation
- student model을 teacher model의 full softmax distribution에 맞도록 학습
- [8] : output categories 간 유사성의 정보를 전달하는 wrong responses의 logits의 분포
- [3] : soft-target distribution 은 teacher’s confidence에 기반한 importance로 적용
- [42] : the posterior entropy viewpoint claiming that soft-targets bring robustness by regularizing a much more informed choice of alternatives than blind entropy regulatization

Are all student networks equally capable of receiving knowledge from different teacher?
- 8 teacher models
- 5 different student architectures from the search space defined by MNAS
- students 모두 다른 성능을 보였고, 모든 teacher network에 대해 best result를 보인 student는 없었음
Distribution
- T(A) & T(B)는 가장 낮은 KL divergence → 분포가 가장 가까움을 의미
- S2는 T(A)에게 best student이지만 T(B)에게는 가장 낮은 성능을 보임
- 좋은 특이성을 위해 distribution을 disentangle하는 것이 필요
Accuracy
- T(A)는 가장 정확한 모델이었지만, student들은 좋은 성능을 보이지 못함
- [25] : 이는 teacher의 complexity가 learning process를 방해하기 때문
  - student가 teacher를 따라하기 위한 capacity가 충분하지 않지만, S2가 S1보다 좋은 성능을 보인 것에 대해 주목해야함
- [6] : 좋은 성능의 teacher의 output은 ground truth와 크게 다르지 않기 때문에 KD는 쓸모가 없을 것
- 성능이 낮은 T(F)가 좋은 성능의 T(A)보다 GT와의 KL divergence가 낮음
미리 정해진 student를 사용하는 것은 student의 parameters를 단순히 teacher’s architecture를 배우는데 사용할 뿐 optimal solution이 아님