During backpropagation, gradients flow backward through the network by multiplying through each layer's weights. Consider a simple 3-layer network:
Gradient flow: ∂Loss/∂w₁ involves multiplying w₃ × w₂ × ...
With large weights (say w = 10 in each layer):
With small weights (say w = 0.5 in each layer):
When gradients explode (become very large), your parameter updates during training become erratic:
With stable gradients from smaller weights, updates are proportional and predictable, leading to smooth convergence.
Consider a simple neuron: output = w₁x₁ + w₂x₂ + ... + bias
With large weights (w₁ = 100, w₂ = 150):