1. Introduction

Convolutional neural networks
- image understanding task의 main design paradigm
- ImageNet처럼 large training set
- NLP에서 사용되었던 attention-based model이 convnet에서도 이용
- 최근에는 transformer를 vision task에서도 사용하는 추세
Visual Transformer (ViT)
- transformer에 raw image patch를 input으로 image classification 진행
- "do not generalize well when trained on insufficient amounts of data"
Data-efficient image Transformer (DeiT)
- single 8-GPU (V100)으로 53시간학습 + 20시간 fine-tuning
- ViT architecture 사용 (original token-based transformer architecture)
- how to distill these model?
  - a token-based strategy, specific to transformers (DeiT⚗)
contributions
- neural networks that contains no convolutional layer
  - 4 GPUs in three day (DeiT-S, DeiT-Ti)
  - DeiT-S, DeiT-Ti은 ResNet-80과 ResNet-18과 메모리 사용량이 비슷하면서 성능이 높음
- distillation token
  - teacher에 의한 label을 생산하는 것을 제외하고 class token과 같은 역할
  - 두 token은 attention을 통해 상호작용
- distillation으로 image transformer는 다른 transformer보다 convnet으로부터 많은 것을 배움
- downstream에서 좋은 성능을 보임

2. Related work

Visual Transformers (ViT)
- convolution없이 sota의 갭을 줄임
- large volume of curated data가 필요

convnet의 많은 성능개선 사항들은 transformer로부터 왔음 (self-attention)
- Squeeze-and-Excitation
- Selective Kernel
- Split-Attention Networks

attention mechanism은 (key, value) vector pair로 이루어짐
- query vector ($q \in \mathbb{R}^d$)은 $k$개의 key vector ($K \in \mathbb{R}^{k \times d}$)집합과 matching
- inner product는 scaled and normalized with a softmax function
- weighted sum of a set of $k$ value vectors ($V \in \mathbb{R}^{k \times d}$)
- $N$ query vector ($Q \in \mathbb{R}^{N \times d}$)에 대해 $N \times d$ output matrix
$$ \text{Attention}(Q, K, V) = \text{Softmax}(QK^\text{T}/\sqrt{d})V, \tag{1} $$
Self-attention layer
- query, key ,value 모두 $N$ input vector로부터 계산됨
  - $Q = XW_\text{Q}$, $K=XW_\text{K}$, $V=XW_\text{V}$
Multi-head self-attention layer(MSA)
- $h$ attention "heads"