Optimization: Gradient Descent

Batch Gradient Descent (BGD)

all samples are used in each iteration to update the gradient to minimize the loss function. The direction determined by the full dataset better represents the sample population and therefore more accurately points to the direction of minima. However, if there are too many samples, the update will take longer (one epoch=one iteration/patameters update)

Mini-batch Gradient Descent (MBGD, miscalled SGD)

Stochastic Gradient Descent (SGD) / On-line Gradient Descent

Numerical gradient check