Maximum Likelihood Estimation
$$
\hat{\theta} \in \{arg\ max\ \ell(\theta;x)\},
$$
- finding the parameter (described as theta) that maximizes the likelihood (l) of making the observation given the parameters
Batch(all) Gradient Descent
w := w - lr * dL/dw
- compute on all training set
Stochastic Gradient Descent
- compute on sample (batch size) of training set which is stochastic approximation of the true cost gradient
- faster matrix operations
- parallelization
- convergence is slower than second-order gradient methods (Newton’s method)
- computationally efficient
- can converge faster when learning rate is adjusted
- Choosing the right batch sizes
Learning rate decay
Batch Normalization