the goal of all these is to keep a good distribution

Activation Functions

**[brain]**activation: stimulus mechanism

Matrix multiplication is linear process, but in most of the time, we need to solve unlinear problem.

sigmoid $\sigma (x)=\frac{1}{1+e^{-x}}$ in (0,1), $\sigma' (x)=\sigma (x)(1-\sigma (x))$
- [brain] historically popular since they have nice interpretation as a saturating “firing rate” of a neuron
- three problem for this function
  1. sigmoid outputs are not zero-centered ⇒ local gradient of sigmoid is always positive ⇒ sign of gradient for all weights is the same as the sign of upstream scalar gradient ⇒ urge us to have a zero-mean data so that we can have a zig-zag path to the optimal weights given the limited allowed gradient updating directions
  2. saturated neurons “kill” the gradients (when |x| is large)
  3. exp is expensive

Untitled

tanh(x)=$\frac{e^x-e^{-x}}{e^x+e^{-x}}$
rectified linear unit(ReLU)
- $ReLU(x)=\max(0,x)$
- annoyance: gradient=0 when x<0
Leaky ReLU [Mass et al., 2013]
- $LeakyReLU(x)=\max(0.01x, x)$
Parametric Rectifier(PReLU) [Kaiming He et al., 2015]
- $PReLU(x)=\max(ax,x)$
Exponential Linear Units (ELU) [Clevert et al., 2015]
- $ELU(x)=a(e^x-1)[x \le 0]+x[x>0]$
- $ELU'(x)=ae^x [x \le 0]+1[x>0]$
Scaled Exponential Linear Units (SELU) [Klambauer et al. ICLR 2017]
- $SELU(x)=[x \le 0]\lambda\alpha(e^x-1)+[x>0]\lambda x$
- $SELU'(x)=[x \le 0]\lambda\alpha e^x+[x>0]\lambda$
Maxout Neuron
- $\max(w_1^Tx+b_1,w_2^Tx+b_2)$
- non-linearity by doubling # parameters
- linear regime, doesn’t saturate, doesn’t die

Weight Initialization

first idea: use a small weight (std=0.01)

deeper layers activations tend to zero, and gradient all zero
std=0.05

deeper layers activations saturate, and gradient all zero

activation at std=0.01

activation at std=0.05

Xavier initializer / “Glorot Normal” or “Glorot Uniform” weight initialization
- Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
- assume zero centered linear activation
- $std=1/ \sqrt {Din}$ (input dimension)
- For conv layers, Din=filter_size^2*input_channels
- explaination: $y=\sum_{i=1}^{Din} x_i w_i,Var(x_i) \text{ the same, want } Var(y)=Var(x_i)$
Kaiming / MSRA initializer / “He Normal” or “He Uniform” weight initialization
- $std=1/ \sqrt {Din/2} =\sqrt {2/Din}$ (input dimension)

Proper initialization is an active area of research…

Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015
Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
All you need is a good init, Mishkin and Matas, 2015
Fixup Initialization: Residual Learning Without Normalization, Zhang et al, 2019
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019