1. Introduction

2. Related Work

3. Method

3.1 Vision Transformer (ViT)

기존 transformer는 token embedding의 1D sequence를 입력으로 받음
- $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$ 의 2D image를 $\mathbf{x} \in \mathbb{R}^{N \times (P^2 \cdot C)}$ flattened 2D patch의 sequence로 reshape
  - $(H, W)$ : the resolution of the original image
  - $C$ : the number of channels
  - $(P, P)$ : the resolution of each image patch
  - $N = HW/P^2$ : the resulting number of patches
- 모든 레이어에 constant latent vector size $D$
  - patch를 flatten → trainable linear projection으로 $D$ dimension에 맞춤 (Eq. 1) → patch embedding
BERT의 $\text{[class]}$ token과 유사하게, learnable embedding을 sequence of embedded patches ($\mathbf{z}^0_0 = \mathbf{x}_\text{class}$) 처럼 사용
- Transformer encoder의 output ($\mathbf{z}^0_L$) 이 image representation $\mathbf{y}$ 역할 (Eq. 4)
- pre-training과 fine-tuning에서 classification head는 $\mathbf{z}^0_L$에 붙음
- classification head는 pre-training 때 one hidden layer, fine-tuning 때 single linear layer로 구성된 MLP
positional information을 나타내는 Position embedding는 patch embedding에 추가됨
- standard learnable 1D position embeddings (2D의 이점이 없었음, Appendix D.3)
Transformer encoder는 MSA (Multi-head Self-Attention)으로 구성됨
- MSA (Multi-head Self-Attention) (Appendix A)
  - $\mathbf{v}$에 weighted sum
  - attention weights $A_{ij}$는 $\mathbf{q}^i$와 $\mathbf{k}^j$의 pairwise similarity (SA : Self-Attention)
  $$ \begin{aligned} [\mathbf{q}, \mathbf{k}, \mathbf{v}] &= \mathbf{zU}{qkv} &\mathbf{U}{qkv} &\in \mathbb{R}^{D \times 3D_h}, &(5) \\ A &= \text{softmax}(\mathbf{qk}^\text{T}/\sqrt{D_h}) &A &\in \mathbb{R}^{N \times N}, &(6) \\ \text{SA}(\mathbf{z}) &= A\mathbf{v}. &&&(7) \end{aligned} $$
  - SA를 병렬로 $k$번 (head) 실행
    - $k$가 변해도 parameter를 유지하기 위해 $D_h$를 $D/k$로 설정
  $$ \begin{aligned} \text{MSA}(\mathbf{z}) &= [\text{SA}_1(z);\text{SA}2(z);...;\text{SA}k(z)]\mathbf{U}{msa} &\mathbf{U}{msa} &\in \mathbb{R}^{k \cdot D_h \times D} &(8) \end{aligned} $$
- MLP blocks (Eq. 2, 3) : two layers with a GELU non-linearity
- Layernorm (LN) before every block
- residual connections after every block

$$ \begin{aligned} \mathbf{z}0 &= [\mathbf{x}\text{class}; \mathbf{x}^1_p\mathbf{E}; \mathbf{x}^2_p\mathbf{E}; ...; \mathbf{x}^N_p\mathbf{E}] + \mathbf{E}{pos}, &&\mathbf{E} \in \mathbb{R}^{(P^2 \cdot C) \times D}, \mathbf{E}{pos} \in \mathbb{R}^{(N+1) \times D} &&(1) \\

\mathbf{z}'l &= \text{MSA}(\text{LN}(\mathbf{z}{l-1})) + \mathbf{z}_{l-1}, &&l=1...L &&(2) \\

\mathbf{z}_l &= \text{MLP}(\text{LN}(\mathbf{z}'_l)) + \mathbf{z}'_l, &&l=1...L &&(3) \\

\mathbf{y} &= \text{LN}(\mathbf{z}^0_L) &&&&(4)

\end{aligned} $$

Hybrid Architecture

input sequence는 CNN의 feature map에서 형성될 수 있음
- patch embedding projection $\mathbf{E}$ (Eq. 1)은 CNN feature map에서 추출된 patch에 적용
- 특수한 경우, patch size가 1x1가 될 수 있는데, 이는 feature map을 flatten하고 Transformer dimension에 projection함을 의미
- 물론, classification input embedding과 position embedding 추가

3.2 Fine-tuning and Higher Resolution

large dataset으로 ViT를 pre-train, smaller downstream task로 fine-tune
- pre-trained prediction head를 제거하고 zero-initialized $D \times K$ feedforward layer를 붙임
  - $K$ : the number of downstream classes
- pre-training보다 higher resolution으로 fine-tune하면 좋음 (Fixing the train-test resolution discrepancy, BiT)
  - patch size는 유지하면 sequence length가 늘어남
  - ViT는 sequence length에 대해 유연하지만, pre-trained position embedding이 의미가 없을 수 있음
  - original image에서 위치에 부합하도록 pre-trained position embedding을 2D interpolation

1. Introduction

2. Related Work

3. Method

3.1 Vision Transformer (ViT)

Hybrid Architecture

3.2 Fine-tuning and Higher Resolution

4. Experiments

4.1 Setup

4.2 Comparison to State of the Art

4.3 Pre-training Data Requirements

4.4 Scaling Study