Convolutional neural networks
Visual Transformer (ViT)
Data-efficient image Transformer (DeiT)
contributions
attention mechanism은 (key, value) vector pair로 이루어짐
$$ \text{Attention}(Q, K, V) = \text{Softmax}(QK^\text{T}/\sqrt{d})V, \tag{1} $$
Self-attention layer
Multi-head self-attention layer(MSA)