Last Updated: December 10, 2025

The GOAT
This is the very first implementation of QKV attention mechanism from the original “Attention Is All You Need” paper (link). In each attention block, multiple indepdent attention head will process inter-token interships and eventually projected to the output.
$$ Q=XW^{Q} \space K=XW^{K} \space V=XW^{V} $$
The following is done per head and the outputs from all heads are combined through projection $W^{O}$
$$ \text{attention\_score}_i = \text{softmax}(\frac{Q_iK_i^{T}}{d_k}) = \text{softmax}(\frac{XW_i^{Q}{W_i^{K}}^{T}X^{T}}{d_k})\\
O = \text{softmax}(\frac{QK^{T}}{d_k})VW^{O} = \text{softmax}(\frac{XW^{Q}{W^{K}}^{T}X^{T}}{d_k})XW^{V}W^{O} $$
[TBD]
One prominent issue with MHA is that incremental inference is often slow due to memory-bandwidth cost of repeatedly loading large “keys” and “values” tensors. So, Noam Shazeer introduces Multi-Query Attention (MQA) in this paper (link), where we only extract one key and value tensor per token attended by multiple query tensors. Yet, due to the loss of expressivity, MQA suffers perfomance degradition compared to MHA.
A natural idea from GR is that instead of keeping 1 key & value tensor, we extract multiple but fewer keys and values, each of which is attented by multiple query tensors.
