Each neuron in the neural network consists of a linear transformation of the input followed by an activation function. Since there's no trainable weight in the activation function, if the initial weights used in the neurons are the same in one layer, then the output and the gradient computed for all the neurons in that layer is the same. In other words, the layer only learns one representation of the input despite using multiple neurons.

Consider two layers in the neural network:

$\mathbf{X}^l=\mathbf{A}^{l-1}\mathbf{W}^l+\mathbf{b}^l$

$\mathbf{A}^l = f^l(\mathbf{X}^l)$

$\mathbf{X}^{l+1}=\mathbf{A}^l\mathbf{W}^{l+1}+\mathbf{b}^{l+1}$

$\mathbf{A}^{l+1} = f^{l+1}(\mathbf{X}^{l+1})$

The principle to conduct weight initialization is to make sure the weights are initialized in a way that the means of the activation functions is $0$, and the variances of the linear combination functions are the same (see the initialization methods below for justification):

$\text{E}[X^{l}]=\text{E}[X^{l-1}]=0$

$\text{Var}[X^l]=\text{Var}[X^{l-1}]$

Here, let's assume that the input $\mathbf{X}^0$ is normalized to have zero mean and unit variance, the bias term is independent from $AW$, the weight $W$ is independent from $A$.

$\text{E}[X]=0$

$\text{Var}[X]=1$

$\text{Cov}(AW, b)=0$

$\text{Cov}(A, W)=0$

Glorot Initialization (Xavier Initialization)

Description

Consider we apply sigmoid or hyperbolic tangent activation function in the neuron and the input:

$\mathbf{A}^{l}=\mathbf{X}^{l}=\mathbf{A}^{l-1}\mathbf{W}+\mathbf{b}^l$

$\frac{\partial\mathbf{a}{j}^l}{\partial w^l{ij}}=\mathbf{a}_{i}^{l-1}$

We can see that the gradient depends on the output of previous layer. Since the input is normalized to have zero mean, if we initialize $\mathbf{b}$ to be zero, the expected value for $A$ should be:

$\mathbf{b}^l=0$

$\text{E}[A^l]=0$

Let's initialize $W$ to have zero mean:

$\text{E}[W]=0$

Since

$\text{Var}[XY]=\text{E}[X]^2\text{Var}[Y]+\text{E}[Y]^2\text{Var}[X]+\text{Var}[X]\text{Var}[Y]$

Rewrite the variance for the output in neuron $j$ in $Layer^l$, and assume the weights in each neuron are independent and identically distributed:

$\text{Var}[a^l_j]=\text{Var}[\sum\limits_{i=1}^{n^{l-1}}a_i^{l-1}w_{ij}^l+b^l_j]=\sum\limits_{i=1}^{n^{l-1}}\text{Var}[a^{l-1}iw^l{ij}+b^l_j]$

$\text{Var}[a^l_j]=\sum\limits_{i=1}^{n^{l-1}}(\text{E}[a_i^{l-1}]^2\text{Var}[w_{ij}^l]+\text{E}[w_{ij}^l]^2\text{Var}[a_{i}^{l-1}]+\text{Var}[a_{i}^{l-1}]\text{Var}[w_{ij}^l]+\text{Var}[b^l_j]-\text{Cov}(a_{i}^{l-1}W^l, b_j^l))$

$\text{Var}[a^l_j]=\sum\limits_{i=1}^{n^{l-1}}(\text{Var}[a_i^{l-1}]\text{Var}[w_{ij}^l])$

Finally, if we initialize the variance of each weight to be the same. and assume the variance of the output in each neuron is the same in $Layer^{l-1}$, then:

$\text{Var}[A^{l}]=n_{l-1}\text{Var}[A^{l-1}]\text{Var}[W^{l}]$

This suggest that the variance of the output of activation in each function

$\text{Var}[A^l]=\prod\limits_{z=1}^{l-1}n_z\text{Var}[W^z]\text{Var}[X^{0}]=\prod\limits_{z=1}^{l-1}n_z\text{Var}[W^z]$

Recall that the gradient of the weight in each layer depends on the output of the previous layer:

$\frac{\partial A_i^l}{\partial \mathbf{w}^l_i}=\frac{\partial X_i^l}{\partial \mathbf{w}^l_i}=\mathbf{X}^{l-1}=\mathbf{A}^{l-1}$

Therefore, we would like to initialize our weight such that the variance of the activations stays the same across the different layer. To address this, Glorot and Bengio suggest to initialize the network with the following rules:

$\mathbf{W}^l\sim\text{N}(0, \frac{1}{n^{l-1}})$

$\mathbf{b}^{l}=0$

Implementation

API (default initialization used in Tensorflow layers)

initializer = tf.keras.initializers.GlorotNormal()
layer = tf.keras.layers.Dense(3, kernel_initializer=initializer)

He Initialization

Description

Consider a ReLU activation function in the first and the second layers:

$a^{l}{ij}=\begin{cases}x^l{ij}\;&\text{if } x_{ij}^l > 0\\0\;&\text{otherwise}\end{cases}$

$\frac{\partial(a^l_{ij})}{\partial x^l_{ij}}=\begin{cases}1\;&\text{if } x^l_{ij} > 0\\0\;&\text{otherwise}\end{cases}$

Unlike the assumption made in Glorot initialization, the output of parametric ReLU does not have zero mean:

$\text{E}[(A^{l-1})^2]\ne0$

However, if we initialize the weight to have zero mean, and $\mathbf{b}$ to be zero:

$\mathbf{b}^l=0$

$\text{E}[(A)^l]^2=\text{E}[(A^{l})^2]=\frac{1}{2}\text{E}[(X^{l})^2]=\frac{1}{2}\text{Var}[(X^{l})]$

Similar to the one we prove in Glorot initialization:

$\text{Var}[x^l_j]=\sum\limits_{i=1}^{n^{l-1}}(\text{E}[a_i^{l-1}]^2\text{Var}[w_{ij}^l]+\text{Var}[a_{i}^{l-1}]\text{Var}[w_{ij}^l])$

$\text{Var}[x^l_j]=\sum\limits_{i=1}^{n^{l-1}}\text{Var}w_{ij}^l$

$\text{Var}[x^l_j]=\sum\limits_{i=1}^{n^{l-1}}\text{Var}w_{ij}^l$

Similarly, if we initialize the variance of each weight to be the same, and assume the variance after the linear combination is the same:

$\text{Var}[X^{l}]=n_{l-1}\text{E}[(A^{l-1})^2]\text{Var}[W^{l}]$

$\text{Var}[X^{l}]=\frac{1}{2}n_{l-1}\text{Var}[(X^{l-1})^2]\text{Var}[W^{l}]$

This suggest the variance of the output after the linear combination is:

$\text{Var}[X^l]=\prod\limits_{z=1}^{l-1}\frac{1}{2}n_z\text{Var}[W^z]\text{Var}[X^{0}]=\prod\limits_{z=1}^{l-1}n_z\text{Var}[W^z]$

Recall that the gradient of the weight in each layer depends on the output of the previous layer:

$\frac{\partial X_i^l}{\partial \mathbf{w}^l_i}=\mathbf{X}^{l-1}$

Therefore, He et el., suggests using the following rules to initialize the weight when using ReLU activation function to control the gradient to be stable across the different layers.

$\mathbf{W}^l\sim\text{N}(0, \frac{2}{n^{l-1}})$

$\mathbf{b}^{l}=0$

Implementation

API

initializer = tf.keras.initializers.HeNormal()
layer = tf.keras.layers.Dense(3, kernel_initializer=initializer

More Resources

  1. Initializing neural networks (deeplearning.ai)