1. Introduction

2. Related Work

3. Method

3.1 Vision Transformer (ViT)

$$ \begin{aligned} \mathbf{z}0 &= [\mathbf{x}\text{class}; \mathbf{x}^1_p\mathbf{E}; \mathbf{x}^2_p\mathbf{E}; ...; \mathbf{x}^N_p\mathbf{E}] + \mathbf{E}{pos}, &&\mathbf{E} \in \mathbb{R}^{(P^2 \cdot C) \times D}, \mathbf{E}{pos} \in \mathbb{R}^{(N+1) \times D} &&(1) \\

\mathbf{z}'l &= \text{MSA}(\text{LN}(\mathbf{z}{l-1})) + \mathbf{z}_{l-1}, &&l=1...L &&(2) \\

\mathbf{z}_l &= \text{MLP}(\text{LN}(\mathbf{z}'_l)) + \mathbf{z}'_l, &&l=1...L &&(3) \\

\mathbf{y} &= \text{LN}(\mathbf{z}^0_L) &&&&(4)

\end{aligned} $$

Hybrid Architecture

3.2 Fine-tuning and Higher Resolution

4. Experiments

4.1 Setup

4.2 Comparison to State of the Art

4.3 Pre-training Data Requirements

4.4 Scaling Study