๐Ÿ“„ ๋…ผ๋ฌธ ๊ฐœ์š”

์ œ๋ชฉ: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

์ €์ž: Angelos Katharopoulos ์™ธ (Idiap Research Institute, EPFL)

๊ฒŒ์žฌ: ICML 2020 (International Conference on Machine Learning)


1. ์„œ๋ก  (Introduction)

1.1 ํ•ต์‹ฌ ๋ฌธ์ œ

Transformer์˜ ๊ณ ์งˆ์ ์ธ ๋ฌธ์ œ:

์ง๊ด€์  ๋น„์œ : ๊ต์‹ค์— ํ•™์ƒ์ด 10๋ช… ์žˆ์œผ๋ฉด ๋ชจ๋“  ํ•™์ƒ ์Œ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ™•์ธํ•˜๋Š” ๊ฑด 45๋ฒˆ์ด์ง€๋งŒ, ํ•™์ƒ์ด 100๋ช…์ด ๋˜๋ฉด? 4,950๋ฒˆ! ๐Ÿ˜ฑ

1.2 ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด

์ €์ž๋“ค์€ ์ด๋ ‡๊ฒŒ ๋งํ•ฉ๋‹ˆ๋‹ค:

"์šฐ๋ฆฌ๋Š” self-attention์„ kernel feature map์˜ ์„ ํ˜• ๋‚ด์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ , ํ–‰๋ ฌ ๊ณฑ์…ˆ์˜ ๊ฒฐํ•ฉ๋ฒ•์น™(associativity)์„ ์ด์šฉํ•ด์„œ ๋ณต์žก๋„๋ฅผ O(Nยฒ)์—์„œ **O(N)**์œผ๋กœ ์ค„์˜€๋‹ค!"

๋” ๋†€๋ผ์šด ๋ฐœ๊ฒฌ:


2. ๊ด€๋ จ ์—ฐ๊ตฌ (Related Work)