Last Updated: December 10, 2025

image.png

The GOAT

Multi-Head Attention (MHA)

This is the very first implementation of QKV attention mechanism from the original “Attention Is All You Need” paper (link). In each attention block, multiple indepdent attention head will process inter-token interships and eventually projected to the output.

Math

$$ Q=XW^{Q} \space K=XW^{K} \space V=XW^{V} $$

The following is done per head and the outputs from all heads are combined through projection $W^{O}$

$$ \text{attention\_score}_i = \text{softmax}(\frac{Q_iK_i^{T}}{d_k}) = \text{softmax}(\frac{XW_i^{Q}{W_i^{K}}^{T}X^{T}}{d_k})\\

O = \text{softmax}(\frac{QK^{T}}{d_k})VW^{O} = \text{softmax}(\frac{XW^{Q}{W^{K}}^{T}X^{T}}{d_k})XW^{V}W^{O} $$

Code

[TBD]

Grouped-Query Attention (GQA)

One prominent issue with MHA is that incremental inference is often slow due to memory-bandwidth cost of repeatedly loading large “keys” and “values” tensors. So, Noam Shazeer introduces Multi-Query Attention (MQA) in this paper (link), where we only extract one key and value tensor per token attended by multiple query tensors. Yet, due to the loss of expressivity, MQA suffers perfomance degradition compared to MHA.

A natural idea from GR is that instead of keeping 1 key & value tensor, we extract multiple but fewer keys and values, each of which is attented by multiple query tensors.

Screenshot 2025-12-09 at 4.14.03 PM.png

Math