Pre-norm and Post-norm

Post-norm and pre-norm are two variants of the transformer architecture that differ in where they apply layer normalization within the transformer block. Used in the original transformer, post-norm (also called a Post-LN transformer) applies layer norm to the output of residual blocks. In contrast, pre-norm (aka Pre-LN transformer) applies layer norm within the residual blocks, specifically to the input before the attention or FFN layer. This distinction is shown in the figure below:

(a) Post-LN transformer block, and (b) Pre-LN transformer block. Source: On Layer Normalization in the Transformer Architecture.

Let’s also do a side by side comparison:

	Post-norm	Pre-norm
Definition	Puts LayerNorm between the residual blocks.	Puts LayerNorm inside the residual block, before Attention/FFN.
Also introduces an extra final-layer layer norm before prediction.
Order of operations	Attention/FFN → Residual → LayerNorm	LayerNorm → Attention/FFN → Residual
Forward pass of transformer block	`x = layerNorm_1(x + attn(x))`
`x = layerNorm_2(x + ffn(x))`	`x = x + attn(layerNorm_1(x))`
`x = x + ffn(layerNorm_2(x))`
Gradients at initialization	The expected gradients near the output layer are large. Thus, a large learning rate makes the training unstable. The warm-up stage helps avoid this problem.	Gradients are well-behaved at initialization. Learning-rate warm-up stage can sometimes even be removed; this reduces #hyperparams, thus easing their tuning.
Learning rate warm-up stage	Critical. Different from many other architectures where the learning rate starts from a relatively large value and then decays.	Less critical; can sometimes be removed entirely.
Training time	Often longer because of the warm-up stage, where the learning rate has to increase gradually from zero.	Often shorter since there is no warm-up stage and the loss decays faster.
Performance	Slightly better	Slightly worse
Usage	Less common despite being used in the original transformer.	Dominant in LLMs because training stability matters more than small performance gains.

Despite post-norm being used in the original transformer, there has been a shift to pre-norm. This is because training stability trumps marginal performance gains, especially as models have become larger and deeper.

References

On Layer Normalization in the Transformer Architecture