Optimization: Gradient Descent
- hyperparameter: $lr$ [most important and most headache-inducing]
- analogy: as a hiker, following the steepest direction that you feel like
- $W'=W-lr*\nabla_WL$
- numerical gradient: $[\bold f(\bold x+\bold h) - \bold f(\bold x-\bold h)] / 2 |\bold h|$, slow but easy to write, iterating over all dimensions one by one
- analytic gradient: fast, exact
- in practice: always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.
- get gradient by Backpropagation: to compute gradient
Batch Gradient Descent (BGD)
all samples are used in each iteration to update the gradient to minimize the loss function. The direction determined by the full dataset better represents the sample population and therefore more accurately points to the direction of minima. However, if there are too many samples, the update will take longer (one epoch=one iteration/patameters update)
Mini-batch Gradient Descent (MBGD, miscalled SGD)
- hyperparameter: batch size
- usually based on memory constraints
- use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2
- the examples in the training data are correlated
- compute the gradient over batches of the training data
Stochastic Gradient Descent (SGD) / On-line Gradient Descent
- Monte-Carlo
- batch size=1
- more sensitive to noise
Numerical gradient check