Batch Normalization

In machine learning, we often refer to the change in the distributions of two independent datasets as covariate shift. For instance, when the training dataset and the testing dataset are sampled from two independent distributions, then it is preferable to perform normalization before training and testing our machine learning model.

Consider training the neural network with multiple hidden layers, for instance:

$\mathbf{X}^l=\mathbf{A}^{l-1}\mathbf{W}^l+\mathbf{b}^l$

$\mathbf{A}^l = f^l(\mathbf{X}^l)$

$\mathbf{X}^{l+1}=\mathbf{A}^l\mathbf{W}^{l+1}+\mathbf{b}^{l+1}$

$\mathbf{A}^{l+1} = f^{l+1}(\mathbf{X}^{l+1})$

Suppose $\mathbf{X}^{l-1}$ has already been normalized such that $\text{E}[\mathbf{X}^{l-1}]=0, \text{Var}[\mathbf{X}^{l-1}]=1$. It is not necessary that the output of $Layer^l$ is also normalized. In other words, there will be a difference in the distribution of the two layers $Layer^l$ and $Layer^{l+1}$. This is referred to as an internal covariate shift in deep neural networks. For instance, when using the sigmoid activation function in the first layer, without proper normalization on $\mathbf{Z}^l$, most of the output will be $0$ with $0$ gradients. This leads to saturation points in the training process, which slows down the training process. Therefore, it is important to ensure the input of each layer follows similar distributions.

Training Process

Alternatively, the authors suggest performing normalization to the input prior to the linear combination and activation function on the input:

$\mathbb{E}[\mathbf{A}j^{l-1}]=\frac{1}{m}\sum\limits{i=1}^mA_{ij}^{l-1}$

$\text{Var}[\mathbf{A}j^{l-1}]=\frac{1}{m}\sum\limits{i=1}^m(A_{ij}^{l-1} - \mathbb{E}[\mathbf{A}_j^{l-1}])^2$

$\mathbf{A}^{l-1}_{norm}=\frac{\mathbf{A}^{l-1}-\text{E}[\mathbf{A}^{l-1}]}{\sqrt{\text{Var}[\mathbf{A}^{l-1}]}}$

$\mathbf{X}^l=\mathbf{A}^{l-1}_{norm}\mathbf{W}^l+\mathbf{b}^l$

$\mathbf{A}^l = f^l(\mathbf{X}^{l})$

This can be thought of as normalizing the batch dimension (compute the mean and variance of each features considering the batch of samples). For instance, if the output of the activation is $\mathbf{A}\in\mathbb{R}^{b\times n}$, batch normalization compute the mean and variance for each feature, resulting in $\mu\in\mathbb{R}^{1\times n}$, $\log\sigma\in\mathbb{R}^{1\times n}$. Each feature is normalized independently (as opposed to layer normalization, where the mean and variance considering all features is computed for each sample in the mini-batch). Note that the normalization takes place for each mini-batch.

Result

Performance on ImageNet Large Scale Visual Recognition Challenge 2012

Inference Process

In inference process, we compute the means and variances considering previous batches used in the training process:

$\text{E}[\mathbf{A}^{l-1}]=\text{E}[\text{E}_{batch}[\mathbf{A^{l-1}}]]$

$\text{Var}[\mathbf{A}^{l-1}]=\frac{m}{m-1}\text{E}[\text{Var}_{batch}[\mathbf{A^{l-1}}]]$

$\mathbf{A}^{l-1}_{norm}=\frac{\mathbf{A}^{l-1}-\text{E}[\mathbf{A}^{l-1}]}{\sqrt{\text{Var}[\mathbf{A}^{l-1}]}}$

$\mathbf{X}^l=\mathbf{A}^{l-1}_{norm}\mathbf{W}^l+\mathbf{b}^l$

$\mathbf{A}^l = f^l(\mathbf{X}^{l})$

where $m$ is the batch size used in the training process.

Implementation

API

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(64, activation='relu'),
])

The authors suggest that using batch normalization can effectively improve the training process of the neural network; however, the author also suggests that:

It is not recommended to use batch normalization along with dropout.
Provide stable results even when using a higher learning rate.
Allow faster decay on the learning rate.
Shuffle the training examples more thoroughly is relevant when using batch normalization.

References

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Ioffe et al., Proceedings of Machine Learning Research, 2015.