read this -
https://towardsdatascience.com/weight-initialization-techniques-in-neural-networks-26c649eb3b78#:~:text=Some also use the following technique for initialization %3A&text=They set the weights neither,vanish or explode too quickly.
Also,
If all weights in a hidden layer are initialized to 0, they essentially become identical. This symmetry leads to the same derivative with respect to the loss function for each weight in that layer.
Consequences:
When all weights in a hidden layer of a neural network are initialized to 0, they become symmetrical in several ways:
1. Identical Inputs: Since all weights in a layer receive the same input (from the previous layer), and they all start with the same value (0), their initial calculations, including the weighted sum of inputs, will be identical.
2. Identical Gradients: During training, the model updates weights based on the gradients of the loss function, which tell you how much changing a specific weight will affect the error. In this case, since all weights contribute the same to the output (due to identical calculations), the gradients for all weights will also be identical.
3. Identical Updates: Consequently, during the weight update step, all weights will receive the same update value based on the calculated gradient. This means they will all change by the same amount and remain identical to each other.
4. Identical Outputs: With all weights being the same, the calculations performed by each neuron in the layer will also be identical. This leads to all neurons producing the same output for any given input, essentially making them redundant copies of each other.
5. Limited Learning Capacity: A crucial benefit of hidden layers is their ability to learn complex, non-linear relationships between inputs and outputs. However, due to the symmetry, all neurons are performing the same simple linear function, effectively limiting the network's learning capacity to that of a single linear layer.
Therefore, although the weights themselves aren't technically "mirrored" or flipped around a central axis, they behave symmetrically in terms of their input, output, and learning capabilities, ultimately rendering the hidden layer redundant.
This phenomenon highlights the importance of using proper weight initialization techniques like Xavier or He initialization, which ensure diverse starting points for the weights, allowing them to learn different functions and contribute meaningfully to the network's learning process.