Post-norm and pre-norm are two variants of the transformer architecture that differ in where they apply layer normalization within the transformer block. Used in the original transformer, post-norm (also called a Post-LN transformer) applies layer norm to the output of residual blocks. In contrast, pre-norm (aka Pre-LN transformer) applies layer norm within the residual blocks, specifically to the input before the attention or FFN layer. This distinction is shown in the figure below:
(a) Post-LN transformer block, and (b) Pre-LN transformer block. Source: On Layer Normalization in the Transformer Architecture.
Let’s also do a side by side comparison:
Post-norm | Pre-norm | |
---|---|---|
Definition | Puts LayerNorm between the residual blocks. | Puts LayerNorm inside the residual block, before Attention/FFN. |
Also introduces an extra final-layer layer norm before prediction. | ||
Order of operations | Attention/FFN → Residual → LayerNorm | LayerNorm → Attention/FFN → Residual |
Forward pass of transformer block | x = layerNorm_1(x + attn(x)) |
|
x = layerNorm_2(x + ffn(x)) |
x = x + attn(layerNorm_1(x)) |
|
x = x + ffn(layerNorm_2(x)) |
||
Gradients at initialization | The expected gradients near the output layer are large. Thus, a large learning rate makes the training unstable. The warm-up stage helps avoid this problem. | Gradients are well-behaved at initialization. Learning-rate warm-up stage can sometimes even be removed; this reduces #hyperparams, thus easing their tuning. |
Learning rate warm-up stage | Critical. Different from many other architectures where the learning rate starts from a relatively large value and then decays. | Less critical; can sometimes be removed entirely. |
Training time | Often longer because of the warm-up stage, where the learning rate has to increase gradually from zero. | Often shorter since there is no warm-up stage and the loss decays faster. |
Performance | Slightly better | Slightly worse |
Usage | Less common despite being used in the original transformer. | Dominant in LLMs because training stability matters more than small performance gains. |
Despite post-norm being used in the original transformer, there has been a shift to pre-norm. This is because training stability trumps marginal performance gains, especially as models have become larger and deeper.