Maximum Likelihood Estimation

$$ \hat{\theta} \in \{arg\ max\ \ell(\theta;x)\}, $$

finding the parameter (described as theta) that maximizes the likelihood (l) of making the observation given the parameters

Batch(all) Gradient Descent

w := w - lr * dL/dw

compute on all training set

Stochastic Gradient Descent

compute on sample (batch size) of training set which is stochastic approximation of the true cost gradient
- faster matrix operations
- parallelization
- convergence is slower than second-order gradient methods (Newton’s method)
  - computationally efficient
  - can converge faster when learning rate is adjusted
Choosing the right batch sizes

Learning rate decay

Batch Normalization