the goal of all these is to keep a good distribution
Activation Functions
**[brain]**activation: stimulus mechanism
Matrix multiplication is linear process, but in most of the time, we need to solve unlinear problem.
- sigmoid $\sigma (x)=\frac{1}{1+e^{-x}}$ in (0,1), $\sigma' (x)=\sigma (x)(1-\sigma (x))$
- [brain] historically popular since they have nice interpretation as a saturating “firing rate” of a neuron
- three problem for this function
- sigmoid outputs are not zero-centered ⇒ local gradient of sigmoid is always positive ⇒ sign of gradient for all weights is the same as the sign of upstream scalar gradient ⇒ urge us to have a zero-mean data so that we can have a zig-zag path to the optimal weights given the limited allowed gradient updating directions
- saturated neurons “kill” the gradients (when |x| is large)
- exp is expensive

- tanh(x)=$\frac{e^x-e^{-x}}{e^x+e^{-x}}$
- rectified linear unit(ReLU)
- $ReLU(x)=\max(0,x)$
- annoyance: gradient=0 when x<0
- Leaky ReLU [Mass et al., 2013]
- $LeakyReLU(x)=\max(0.01x, x)$
- Parametric Rectifier(PReLU) [Kaiming He et al., 2015]
- Exponential Linear Units (ELU) [Clevert et al., 2015]
- $ELU(x)=a(e^x-1)[x \le 0]+x[x>0]$
- $ELU'(x)=ae^x [x \le 0]+1[x>0]$
- Scaled Exponential Linear Units (SELU) [Klambauer et al. ICLR 2017]
- $SELU(x)=[x \le 0]\lambda\alpha(e^x-1)+[x>0]\lambda x$
- $SELU'(x)=[x \le 0]\lambda\alpha e^x+[x>0]\lambda$
- Maxout Neuron
- $\max(w_1^Tx+b_1,w_2^Tx+b_2)$
- non-linearity by doubling # parameters
- linear regime, doesn’t saturate, doesn’t die
Weight Initialization
-
first idea: use a small weight (std=0.01)
deeper layers activations tend to zero, and gradient all zero
-
std=0.05
deeper layers activations saturate, and gradient all zero

activation at std=0.01

activation at std=0.05
- Xavier initializer / “Glorot Normal” or “Glorot Uniform” weight initialization
- Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
- assume zero centered linear activation
- $std=1/ \sqrt {Din}$ (input dimension)
- For conv layers, Din=filter_size^2*input_channels
- explaination: $y=\sum_{i=1}^{Din} x_i w_i,Var(x_i) \text{ the same, want } Var(y)=Var(x_i)$
- Kaiming / MSRA initializer / “He Normal” or “He Uniform” weight initialization
- $std=1/ \sqrt {Din/2} =\sqrt {2/Din}$ (input dimension)
Proper initialization is an active area of research…
- Understanding the difficulty of training deep feedforward neural networks
by Glorot and Bengio, 2010
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
- Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
- Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et
al., 2015
- Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
- All you need is a good init, Mishkin and Matas, 2015
- Fixup Initialization: Residual Learning Without Normalization, Zhang et al, 2019
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019