요약

무려 3만회나 인용된, 가히 혁신적이라고 할 수 있는 논문이다. Attention Mechanism만을 이용한 모델 Transformer을 처음으로 제안하였으며, NLP에 쓰이다 나중에 DETR, ViT(AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE), Point Transformer(Point Transformer), Perceiver(Perceiver: General Perception with Iterative Attention) 등 Transformer을 이용한 논문이 수많은 영역에서 우후죽순 생겨나며 간혹 SOTA도 차지하고 있다.

1. Introduction

RNN, LSTM, GRU가 language modeling, machine translation 같은 sequencing modeling과 transduction problems에서 SOTA를 차지해왔다. encoder-decoder arch와 recurrent language models의 경계를 확장하려는 노력도 있어왔다.

Recurrent model의 경우 $h_t$를 $h_{t-1}$에서 계산해 냄으로써 지속적으로 hidden states를 만들어 나가는데, 이러한 sequential nature은 training 도중의 parallelization을 못하도록 한다. 개선 사항이 있었고 improvement of model performance도 있었지만 근본적인 제약은 해결되지 않는다.

요약

1. Introduction

3. Model Architecture