image.png

Post-norm and pre-norm are two variants of the transformer architecture that differ in where they apply layer normalization within the transformer block. Used in the original transformer, post-norm (also called a Post-LN transformer) applies layer norm to the output of residual blocks. In contrast, pre-norm (aka Pre-LN transformer) applies layer norm within the residual blocks, specifically to the input before the attention or FFN layer. This distinction is shown in the figure below:

(a) Post-LN transformer block, and (b) Pre-LN transformer block. Source: On Layer Normalization in the Transformer Architecture.

(a) Post-LN transformer block, and (b) Pre-LN transformer block. Source: On Layer Normalization in the Transformer Architecture.

Let’s also do a side by side comparison:

Post-norm Pre-norm
Definition Puts LayerNorm between the residual blocks. Puts LayerNorm inside the residual block, before Attention/FFN.
Also introduces an extra final-layer layer norm before prediction.
Order of operations Attention/FFN → Residual → LayerNorm LayerNorm → Attention/FFN → Residual
Forward pass of transformer block x = layerNorm_1(x + attn(x))
x = layerNorm_2(x + ffn(x)) x = x + attn(layerNorm_1(x))
x = x + ffn(layerNorm_2(x))
Gradients at initialization The expected gradients near the output layer are large. Thus, a large learning rate makes the training unstable. The warm-up stage helps avoid this problem. Gradients are well-behaved at initialization. Learning-rate warm-up stage can sometimes even be removed; this reduces #hyperparams, thus easing their tuning.
Learning rate warm-up stage Critical. Different from many other architectures where the learning rate starts from a relatively large value and then decays. Less critical; can sometimes be removed entirely.
Training time Often longer because of the warm-up stage, where the learning rate has to increase gradually from zero. Often shorter since there is no warm-up stage and the loss decays faster.
Performance Slightly better Slightly worse
Usage Less common despite being used in the original transformer. Dominant in LLMs because training stability matters more than small performance gains.

Despite post-norm being used in the original transformer, there has been a shift to pre-norm. This is because training stability trumps marginal performance gains, especially as models have become larger and deeper.

References