Evolution of Attention Mechanisms | Notion

<aside> 💡

MHA → MQA → GQA → MLA → DSA

</aside>

Multi Head Attention

Each query gets its own KV pair. Eg - in 8 heads, for 8 query heads we have 8 key heads and 8 value heads.
max expressiveness
memory cost is enormous
slow but high quality

Multi Query Attention

All query heads get a single KV head. 8 query heads, 1 key value head.
quality degrades
fast but unstable

middle ground → GQA

Grouped Query Attention

Create groups of query heads and each group shares one KV head. eg - 8 query heads grouped into 4 groups, having one KV head each, so 4 KV heads.
Intuition - queries in a group attend to similar patterns - similar attention = share KV
GQ8 - industry sweet spot - group into 8; 8 KV heads
made llama3 5x faster without sacrificing quality - MHA → MQA → GQA