Sparse Transformer

Generating Long Sequences with Sparse Transformers

논문이 하려고 하는 이야기

기존 Attention Matrix -> O(n^2)

새로운 Attention (Sparse) -> O(n루트n)

=> 합쳐서 Sparse Transformer!

(a)가 기존의 Transformer

(b)는 이미지/음악파일 같이 특정한 길이가 의미를 갖는 경우 해당 길이/주기만큼을 Attention

(c)는 Text처럼 특정한 길이 의미 없이 정해진 시퀀스 길이에 따른 Attention 결정

위쪽 이미지는 6*6 사이즈의 "이미지"

아래쪽은 Connectivity Matrix -> 실제로 펼치면 어떤 Attention을 취하는지 보여주는 셈 (어텐션 패턴)

이 논문에서 연구 방향은 Sparse Attention Pattern 자체에만 집중함!