In machine learning, we often refer to the change in the distributions of two independent datasets as covariate shift. For instance, when the training dataset and the testing dataset are sampled from two independent distributions, then it is preferable to perform normalization before training and testing our machine learning model.
Consider training the neural network with multiple hidden layers, for instance:
$\mathbf{X}^l=\mathbf{A}^{l-1}\mathbf{W}^l+\mathbf{b}^l$
$\mathbf{A}^l = f^l(\mathbf{X}^l)$
$\mathbf{X}^{l+1}=\mathbf{A}^l\mathbf{W}^{l+1}+\mathbf{b}^{l+1}$
$\mathbf{A}^{l+1} = f^{l+1}(\mathbf{X}^{l+1})$
Suppose $\mathbf{X}^{l-1}$ has already been normalized such that $\text{E}[\mathbf{X}^{l-1}]=0, \text{Var}[\mathbf{X}^{l-1}]=1$. It is not necessary that the output of $Layer^l$ is also normalized. In other words, there will be a difference in the distribution of the two layers $Layer^l$ and $Layer^{l+1}$. This is referred to as an internal covariate shift in deep neural networks. For instance, when using the sigmoid activation function in the first layer, without proper normalization on $\mathbf{Z}^l$, most of the output will be $0$ with $0$ gradients. This leads to saturation points in the training process, which slows down the training process. Therefore, it is important to ensure the input of each layer follows similar distributions.
Alternatively, the authors suggest performing normalization to the input prior to the linear combination and activation function on the input:
$\mathbb{E}[\mathbf{A}j^{l-1}]=\frac{1}{m}\sum\limits{i=1}^mA_{ij}^{l-1}$
$\text{Var}[\mathbf{A}j^{l-1}]=\frac{1}{m}\sum\limits{i=1}^m(A_{ij}^{l-1} - \mathbb{E}[\mathbf{A}_j^{l-1}])^2$
$\mathbf{A}^{l-1}_{norm}=\frac{\mathbf{A}^{l-1}-\text{E}[\mathbf{A}^{l-1}]}{\sqrt{\text{Var}[\mathbf{A}^{l-1}]}}$
$\mathbf{X}^l=\mathbf{A}^{l-1}_{norm}\mathbf{W}^l+\mathbf{b}^l$
$\mathbf{A}^l = f^l(\mathbf{X}^{l})$
This can be thought of as normalizing the batch dimension (compute the mean and variance of each features considering the batch of samples). For instance, if the output of the activation is $\mathbf{A}\in\mathbb{R}^{b\times n}$, batch normalization compute the mean and variance for each feature, resulting in $\mu\in\mathbb{R}^{1\times n}$, $\log\sigma\in\mathbb{R}^{1\times n}$. Each feature is normalized independently (as opposed to layer normalization, where the mean and variance considering all features is computed for each sample in the mini-batch). Note that the normalization takes place for each mini-batch.


Performance on ImageNet Large Scale Visual Recognition Challenge 2012
In inference process, we compute the means and variances considering previous batches used in the training process:
$\text{E}[\mathbf{A}^{l-1}]=\text{E}[\text{E}_{batch}[\mathbf{A^{l-1}}]]$
$\text{Var}[\mathbf{A}^{l-1}]=\frac{m}{m-1}\text{E}[\text{Var}_{batch}[\mathbf{A^{l-1}}]]$
$\mathbf{A}^{l-1}_{norm}=\frac{\mathbf{A}^{l-1}-\text{E}[\mathbf{A}^{l-1}]}{\sqrt{\text{Var}[\mathbf{A}^{l-1}]}}$
$\mathbf{X}^l=\mathbf{A}^{l-1}_{norm}\mathbf{W}^l+\mathbf{b}^l$
$\mathbf{A}^l = f^l(\mathbf{X}^{l})$
where $m$ is the batch size used in the training process.
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dense(64, activation='relu'),
])
The authors suggest that using batch normalization can effectively improve the training process of the neural network; however, the author also suggests that: