1.Introduction

Knowledge transfer
- Distillation
  - 같은 dataset, 다른 architecture
  - neural network architecture search, model compression, creating diverse ensembles
- Transfer Learning
  - 같은 architecture, 다른 dataset
  - 적은 데이터로 학습하기
matching the Jacobian of the network's output
- Jacobian을 matching는 아이디어는 이전부터 있었음 (Sobolev training, Jacobians as attention maps)
- 하지만 Jacobians을 matching하기 위한 loss function은 사용되지 않음
In this study,
- input에 noise를 추가해서 Jacobians을 matching → classical distillation의 특수 케이스
- 최근 transfer learning method에 Jacobians matching까지
- 임의의 두 네트워크를 사용하는 경우에도 Jacobian matching 가능

2. Related Work

Sobolev training

https://arxiv.org/pdf/1706.04859.pdf
- 적은 데이터로 학습 중에 higher order derivatives를 함께 사용 → 이 논문에서 activation matching을 통해 regular distillation과 이 방법의 관계를 명확히 함
- Data Jacobian Matrix, Jacobian as attention map
Jacobian-norm regularizer (1992)
- Jacobian norm을 penalizing
Knowledge Distillation
- Softmax with temperature (Hinton et al., 2015)
- Squared error between logits (Ba & Caruana, 2014)
- Matching intermediate features along with the outputs (Romero et al., 2014, Zagoruyko & Komodakis, 2017)
- Adding noise to logits (Sau & Balasubramanian, 2016)

the first order Taylor series expansion

$$ f(\mathbf{x}+\Delta\mathbf{x})=f(\mathbf{x})+\nabla_xf(\mathbf{x})^T(\Delta\mathbf{x})+\mathcal{O}(\epsilon^2) \tag{1} $$
- $f : \mathbb{R}^D \rightarrow \mathbb{R}$
- $\{\mathbf{x}+\Delta\mathbf{x}:||\Delta\mathbf{x}||\leq\epsilon\}$
neural nets에 존재하는 non-linearity
- non-linear activation (ReLU, sigmoid, ...)
- pooling operators

ReLU는 $z=0$ 인 경우를 제외하고 $\frac{d\sigma(z)}{dz}=0\text{ or }1$ (max-pooling도 유사한 상태)
Equation 1에서 piecewise linear net에서 super-linear term을 정확히 0으로 만드는 $\epsilon>0$이 존재
- $f(\mathbf{x}+\Delta\mathbf{x})=f(\mathbf{x})+\nabla_xf(\mathbf{x})^T(\Delta\mathbf{x})$ ($\mathbf{x}$와 $\Delta\mathbf{x}$은 같은 linear surface에 존재)

Jacobian은 network architecture의 크기에 독립적임
- $k$ output classes, input dimension $D$ → neural network의 Jacobian의 dimension은 $D \times k$
  
  → 다른 architecture의 Jacobian도 비교가능함!
다른 weight configuration이라도 같은 Jacobian이 나올 수 있음
- intermediate hidden layer의 neuron이 permutation symmetry인 경우
  - 다른 permutation이더라도 Jacobian은 변하지 않음 (근본적으로 같은 함수이기 때문)
- 일반적으로 neural network model의 redundancy와 loss surface의 non-convexity 때문에 발생
두 properties들은 knowledge transfer에 사용되어야 함 → 어떻게???