Multi-Head Attention (MHA)

This is the very first implementation of QKV attention mechanism from the original “Attention Is All You Need” paper (link). In each attention block, multiple indepdent attention head will process inter-token interships and eventually projected to the output.

Math

Input Data: $X \in \mathbb{R}^{L \times d}$ where $L$ is the input sequence dimension, $d$ is head dimension
QKV + O Matrices:
- $W^{Q} \in \mathbb{R}^{d \times (d_k \times h)}$
- $W^{K} \in \mathbb{R}^{d \times (d_k \times h)}$
- $W^{V} \in \mathbb{R}^{d \times (d_v \times h)}$
- $W^{O} \in \mathbb{R}^{(d_v \times h) \times d}$
- $d_{k/v}$ is head dimension, $h$ is number of heads

$$ Q=XW^{Q} \space K=XW^{K} \space V=XW^{V} $$

The following is done per head and the outputs from all heads are combined through projection $W^{O}$

$$ \text{attention\_score}_i = \text{softmax}(\frac{Q_iK_i^{T}}{d_k}) = \text{softmax}(\frac{XW_i^{Q}{W_i^{K}}^{T}X^{T}}{d_k})\\

O = \text{softmax}(\frac{QK^{T}}{d_k})VW^{O} = \text{softmax}(\frac{XW^{Q}{W^{K}}^{T}X^{T}}{d_k})XW^{V}W^{O} $$

Code

[TBD]

Grouped-Query Attention (GQA)

One prominent issue with MHA is that incremental inference is often slow due to memory-bandwidth cost of repeatedly loading large “keys” and “values” tensors. So, Noam Shazeer introduces Multi-Query Attention (MQA) in this paper (link), where we only extract one key and value tensor per token attended by multiple query tensors. Yet, due to the loss of expressivity, MQA suffers perfomance degradition compared to MHA.

A natural idea from GR is that instead of keeping 1 key & value tensor, we extract multiple but fewer keys and values, each of which is attented by multiple query tensors.

Screenshot 2025-12-09 at 4.14.03 PM.png

Math

QKV + O Matrices:
- $W^{Q} \in \mathbb{R}^{d \times (d_k \times h)}$
- $W^{K} \in \mathbb{R}^{d \times (d_k \times n_g)}$
- $W^{V} \in \mathbb{R}^{d \times (d_v \times n_g)}$
- $W^{O} \in \mathbb{R}^{(d_v \times h) \times d}$
- $d_{k/v}$ is head dimension, $h$ is number of heads for query, $n_g$ is the number of groups