<aside>
💡
MHA → MQA → GQA → MLA → DSA
</aside>
Multi Head Attention
- Each query gets its own KV pair. Eg - in 8 heads, for 8 query heads we have 8 key heads and 8 value heads.
- max expressiveness
- memory cost is enormous
- slow but high quality
Multi Query Attention
- All query heads get a single KV head. 8 query heads, 1 key value head.
- quality degrades
- fast but unstable
middle ground → GQA
Grouped Query Attention
- Create groups of query heads and each group shares one KV head. eg - 8 query heads grouped into 4 groups, having one KV head each, so 4 KV heads.
- Intuition - queries in a group attend to similar patterns - similar attention = share KV
- GQ8 - industry sweet spot - group into 8; 8 KV heads
- made llama3 5x faster without sacrificing quality - MHA → MQA → GQA