Optimization: Gradient Descent

hyperparameter: $lr$ [most important and most headache-inducing]
analogy: as a hiker, following the steepest direction that you feel like
$W'=W-lr*\nabla_WL$
numerical gradient: $[\bold f(\bold x+\bold h) - \bold f(\bold x-\bold h)] / 2 |\bold h|$, slow but easy to write, iterating over all dimensions one by one
analytic gradient: fast, exact
in practice: always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.
get gradient by Backpropagation: to compute gradient

Batch Gradient Descent (BGD)

all samples are used in each iteration to update the gradient to minimize the loss function. The direction determined by the full dataset better represents the sample population and therefore more accurately points to the direction of minima. However, if there are too many samples, the update will take longer (one epoch=one iteration/patameters update)

Mini-batch Gradient Descent (MBGD, miscalled SGD)

hyperparameter: batch size
- usually based on memory constraints
- use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2
the examples in the training data are correlated
compute the gradient over batches of the training data

Stochastic Gradient Descent (SGD) / On-line Gradient Descent

Monte-Carlo
batch size=1
more sensitive to noise

Optimization: Gradient Descent

Batch Gradient Descent (BGD)

Mini-batch Gradient Descent (MBGD, miscalled SGD)

Stochastic Gradient Descent (SGD) / On-line Gradient Descent

Numerical gradient check