Understanding ADAM Optimizer

-Adwita Singh

The article’s contents and algorithm are based on the paper: https://arxiv.org/abs/1412.6980.

Okay, so we all know of the Unofficial PyTorch Optimization Loop Song by Daniel Burke, don’t we?(https://www.youtube.com/watch?v=Nutpusq_AFw)

And I’m pretty sure we all have some basic idea about how loss is calculated in a standard neural network (and why it is needed) (TL;DR: to check how far our predictions are from true values).

But once the loss has been computed, and gradients have been calculated via the backward pass, how do we actually adjusts our parameters (or in our case, weights and biases) so as to make sure that the next round of forward pass in our model would produce a lower loss value, and therefore predictions that would be closer to the true values in our data?

Too much to unpack maybe? Let’s understand with an example

Why Do We Need An Optimizer?

The thing with a loss function is that it only tells us how bad our model is, and not exactly how to improve it (kinda like that one Math teacher we all had).

We begin with random weights when starting with the training of our model. These become the parameters for our model, and are represented by θ.

During the backward pass, we compute the gradient so we can adjust the values of these parameters for our model. These gradients are stored in each parameter’s .grad attribute.

Now, the optimizer's job is to then use those gradients to update the parameters, according to its specific algorithm.

We have Optimizers like Stochastic Gradient Descent (SGD), which use fixed updates for updating θ. Equation for SGD updates can be written as:

θ←θ−η⋅∇θL(θ)

Where:

η: learning rate

Table of Contents

Why Do We Need An Optimizer?