1. Introduction
2. Related Work
3. Method
3.1 Vision Transformer (ViT)

- 기존 transformer는 token embedding의 1D sequence를 입력으로 받음
- $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$ 의 2D image를 $\mathbf{x} \in \mathbb{R}^{N \times (P^2 \cdot C)}$ flattened 2D patch의 sequence로 reshape
- $(H, W)$ : the resolution of the original image
- $C$ : the number of channels
- $(P, P)$ : the resolution of each image patch
- $N = HW/P^2$ : the resulting number of patches
- 모든 레이어에 constant latent vector size $D$
- patch를 flatten → trainable linear projection으로 $D$ dimension에 맞춤 (Eq. 1)
→ patch embedding
- BERT의 $\text{[class]}$ token과 유사하게, learnable embedding을 sequence of embedded patches ($\mathbf{z}^0_0 = \mathbf{x}_\text{class}$) 처럼 사용
- Transformer encoder의 output ($\mathbf{z}^0_L$) 이 image representation $\mathbf{y}$ 역할 (Eq. 4)
- pre-training과 fine-tuning에서 classification head는 $\mathbf{z}^0_L$에 붙음
- classification head는 pre-training 때 one hidden layer, fine-tuning 때 single linear layer로 구성된 MLP
- positional information을 나타내는 Position embedding는 patch embedding에 추가됨
- standard learnable 1D position embeddings (2D의 이점이 없었음, Appendix D.3)
- Transformer encoder는 MSA (Multi-head Self-Attention)으로 구성됨
-
MSA (Multi-head Self-Attention) (Appendix A)
- $\mathbf{v}$에 weighted sum
- attention weights $A_{ij}$는 $\mathbf{q}^i$와 $\mathbf{k}^j$의 pairwise similarity (SA : Self-Attention)
$$
\begin{aligned}
[\mathbf{q}, \mathbf{k}, \mathbf{v}] &= \mathbf{zU}{qkv} &\mathbf{U}{qkv} &\in \mathbb{R}^{D \times 3D_h}, &(5) \\
A &= \text{softmax}(\mathbf{qk}^\text{T}/\sqrt{D_h}) &A &\in \mathbb{R}^{N \times N}, &(6) \\
\text{SA}(\mathbf{z}) &= A\mathbf{v}. &&&(7)
\end{aligned}
$$
- SA를 병렬로 $k$번 (head) 실행
- $k$가 변해도 parameter를 유지하기 위해 $D_h$를 $D/k$로 설정
$$
\begin{aligned}
\text{MSA}(\mathbf{z}) &= [\text{SA}_1(z);\text{SA}2(z);...;\text{SA}k(z)]\mathbf{U}{msa} &\mathbf{U}{msa} &\in \mathbb{R}^{k \cdot D_h \times D} &(8)
\end{aligned}
$$
-
MLP blocks (Eq. 2, 3) : two layers with a GELU non-linearity
-
Layernorm (LN) before every block
-
residual connections after every block
$$
\begin{aligned}
\mathbf{z}0
&= [\mathbf{x}\text{class}; \mathbf{x}^1_p\mathbf{E}; \mathbf{x}^2_p\mathbf{E}; ...; \mathbf{x}^N_p\mathbf{E}] + \mathbf{E}{pos},
&&\mathbf{E} \in \mathbb{R}^{(P^2 \cdot C) \times D}, \mathbf{E}{pos} \in \mathbb{R}^{(N+1) \times D}
&&(1) \\
\mathbf{z}'l
&= \text{MSA}(\text{LN}(\mathbf{z}{l-1})) + \mathbf{z}_{l-1},
&&l=1...L
&&(2) \\
\mathbf{z}_l
&= \text{MLP}(\text{LN}(\mathbf{z}'_l)) + \mathbf{z}'_l,
&&l=1...L
&&(3) \\
\mathbf{y}
&= \text{LN}(\mathbf{z}^0_L)
&&&&(4)
\end{aligned}
$$
Hybrid Architecture
- input sequence는 CNN의 feature map에서 형성될 수 있음
- patch embedding projection $\mathbf{E}$ (Eq. 1)은 CNN feature map에서 추출된 patch에 적용
- 특수한 경우, patch size가 1x1가 될 수 있는데, 이는 feature map을 flatten하고 Transformer dimension에 projection함을 의미
- 물론, classification input embedding과 position embedding 추가
3.2 Fine-tuning and Higher Resolution
- large dataset으로 ViT를 pre-train, smaller downstream task로 fine-tune
- pre-trained prediction head를 제거하고 zero-initialized $D \times K$ feedforward layer를 붙임
- $K$ : the number of downstream classes
- pre-training보다 higher resolution으로 fine-tune하면 좋음 (Fixing the train-test resolution discrepancy, BiT)
- patch size는 유지하면 sequence length가 늘어남
- ViT는 sequence length에 대해 유연하지만, pre-trained position embedding이 의미가 없을 수 있음
- original image에서 위치에 부합하도록 pre-trained position embedding을 2D interpolation
4. Experiments
4.1 Setup
4.2 Comparison to State of the Art
4.3 Pre-training Data Requirements
4.4 Scaling Study