Gradient Descent

Univariate Gradient Descent

Gradient descent is a commonly used algorithm aims at optimizing the variables in the machine learning model. Consider we have a nonlinear relation between the target $\mathbf{y}$ and the feature $\mathbf{y}=\frac{1}{1000}(10\mathbf{x})^3 + \frac{1}{20}(10\mathbf{x})^2+\frac{1}{100}10\mathbf{x}=\mathbf{x}^3+5\mathbf{x}^2+\frac{1}{10}\mathbf{x}$. Suppose we have some knowledge on the model and we know the coefficients of the exponenetial term. Now we would like to construct a machine learning model to fit the above-mentioned function: $\mathbf{\hat{y}}=\frac{1}{1000}(w\mathbf{x})^3 + \frac{1}{20}(w\mathbf{x})^2 + \frac{1}{100}(w\mathbf{x})$. It is trivial to see that $w=10$ gives us the best approximation of $\mathbf{y}$. Here, we will use a gradient descent algorithm to approximate this solution.

First, we need to define the objective function between $\mathbf{y}$ and $\mathbf{\hat{y}}$, let's use mean square error here:

$L(\mathbf{y}, \mathbf{\hat{y}})=\sum\limits_{i=1}^{m}(y_i - \hat{y}i)^2=\sum\limits{i=1}^{m}[y_i - (\frac{w^3x_{i}^3}{1000} + \frac{w^2x_{i}^2}{20} + \frac{wx_{i}}{100})]^2$

We can then compute the gradient of $L$ with respect to $w$ (see Objective Functions in Neural Network Basics ):

$\nabla_{\mathbf{w}}L=2(\frac{3w^2\mathbf{x}^3}{1000} + \frac{w\mathbf{x}^2}{10} + \frac{\mathbf{x}}{100})^T(\mathbf{\hat{y}} - \mathbf{y})$

The derived gradient will always point towards the opposite direction from the weight that yield local minimum of the loss surface. Therefore, we can use this property to update the weight to make sure it is closer to the local minimum:

$\mathbf{w}^{(t+1)}=\mathbf{w}^{(t)} - \eta\nabla_{\mathbf{w}}L(\mathbf{\hat{y}})$

where $\eta$ is a hyperparameter called learning rate, which is used to control the magnitude of each updates.

eta = 0.5
num_samples = 10
tf.random.set_seed(0)
x = tf.random.normal((num_samples, 1), mean=0, stddev=1)
y = x ** 3 + 5 * x ** 2 + 0.1 * x
w = tf.Variable(0.5)

mse = tf.keras.losses.MeanSquaredError()

for i in range(0, 10):
    with tf.GradientTape() as g:
        g.watch(w)
        y_hat = ((w * x) ** 3 + 50 * (w * x) ** 2 + 10 * w * x) / 1000
        loss = mse(y, y_hat)
    gradient = g.gradient(loss, w)
    w = w.assign(w - eta * gradient)

Multivariate Gradient Descent

Now let's look into multiple variables regression. Given a nonlinear transformation of $\mathbf{X}$ involving two variables: $\mathbf{y}=ReLU(5.0\mathbf{x}_1 + 5.0\mathbf{x}_2)$, we would like to construct a predictor to approximate this function: $\hat{y}=ReLU(w_1\mathbf{x}_1+w_2\mathbf{x}_2)$:

$\mathbf{\hat{y}} = ReLU(\mathbf{X}\mathbf{w}) = ReLU(\begin{bmatrix} x_{1,1} & x_{1,2}\\ x_{2,1} & x_{2,2}\\ \vdots & \vdots \\ x_{m,1} & x_{m,2}\\\end{bmatrix}\begin{bmatrix} w_{1} \\ w_{2} \\\end{bmatrix})$

The squared loss function then become:

$L=\frac{1}{m}\sum\limits_{i=1}^{m}(y_i - \hat{y}i)^2=\frac{1}{m}\sum\limits{i=1}^{m}[y_i - ReLU(\sum\limits_{j=1}^{2}w_jx_{ij})]^2$

The gradient of $L$ with respect to $\mathbf{w}$ is:

$\nabla_{\mathbf{w}}L(\mathbf{\hat{y}})=2\frac{\partial(\text{ReLU}(\mathbf{X}))}{\partial\mathbf{w}}(\mathbf{\hat{y}}-\mathbf{y})$

Although we only show the example of two dimensional feature matrix here for simplicity. The algorithm can be easily generalized to higher dimension.

eta = 1
num_samples = 10

tf.random.set_seed(0)
x = tf.random.normal((num_samples, 2), mean=0, stddev=1)
w = tf.Variable([[8.], [10.]])
y = activations.relu(tf.matmul(x, w))
w = tf.Variable([[-15.], [-15.]])

mse = tf.keras.losses.MeanSquaredError()

for i in range(0, 10):
    with tf.GradientTape() as g:
        g.watch(w)
        y_hat = activations.relu(tf.matmul(x, w))
        loss = mse(y, y_hat)
    gradient = g.gradient(loss, w)
    w = w.assign(w - eta * gradient)

Epoch and Batch

When we compute the loss and the gradient, we can either consider every sample or a subset of samples at a time. We call the iteration that every sample in the dataset has been used to update the weight an epoch. In each epoch, we can split the samples into different batches. Note that it is important to shuffle the order of samples in each epoch.

Mini-batch gradient descent: updates the weights consider a batch of samples at a time.
Stochastic gradient descent (SGD): update the weights consider one sample at a time.

Instability

The gradient descent algorithm might not be able to find the optimal weights that globally minimizes the loss function for many reasons. For instance, if the learning rate is set to be too large, the updated process can be unstable and might lead to non-optimal approximation. This problem can be identified by inspecting the training loss in each epoch:

eta = 1.5
num_samples = 10
tf.random.set_seed(0)
x = tf.random.normal((num_samples, 1), mean=0, stddev=1)
y = x ** 3 + 5 * x ** 2 + 0.1 * x

w = tf.Variable(0.5)

for i in range(0, 10):
    with tf.GradientTape() as g:
        g.watch(w)
        y_hat = ((w * x) ** 3 + 50 * (w * x) ** 2 + 10 * w * x) / 1000
        loss = tf.reduce_mean(tf.pow(y - y_hat, 2))
    gradient = g.gradient(loss, w)
    w = w.assign(w - eta * gradient) 
    print(f'iteration:{i+1:>3}{"":>4}weight:{w.numpy():>7.3f}{"":>4}loss:{loss.numpy():>7.3f}')

iteration:  1    weight:  1.515    loss: 36.317
iteration:  2    weight:  4.452    loss: 34.980
iteration:  3    weight: 12.023    loss: 24.310
iteration:  4    weight: -2.383    loss:  8.710
iteration:  5    weight: -6.264    loss: 33.324
iteration:  6    weight:-12.476    loss: 18.812
iteration:  7    weight: -2.094    loss: 16.372
iteration:  8    weight: -5.565    loss: 34.032
iteration:  9    weight:-11.888    loss: 21.743
iteration: 10    weight: -4.581    loss: 12.915

However, if the learning rate is set to be too small. The learning process becomes extremely slow and might be stuck in the local minimum or saddle points. Therefore, it is always important to keep track of the loss during gradient descent. The difficulties described here is usually addressed using proper weight initialization or applying the optimizers, which we will explain in more details in other sections:

eta = 0.2
num_samples = 10
tf.random.set_seed(0)
x = tf.random.normal((num_samples, 1), mean=0, stddev=1)
y = x ** 3 + 5 * x ** 2 + 0.1 * x

w = tf.Variable(-5.)

for i in range(0, 10):
    with tf.GradientTape() as g:
        g.watch(w)
        y_hat = ((w * x) ** 3 + 50 * (w * x) ** 2 + 10 * w * x) / 1000
        loss = tf.reduce_mean(tf.pow(y - y_hat, 2))
    gradient = g.gradient(loss, w)
    w = w.assign(w - eta * gradient) 
    print(f'iteration:{i+1:>3}{"":>4}weight:{w.numpy():>7.3f}{"":>4}loss:{loss.numpy():>7.3f}')

iteration:  1    weight: -5.829    loss: 24.109
iteration:  2    weight: -6.671    loss: 20.628
iteration:  3    weight: -7.473    loss: 17.149
iteration:  4    weight: -8.184    loss: 14.095
iteration:  5    weight: -8.766    loss: 11.784
iteration:  6    weight: -9.206    loss: 10.289
iteration:  7    weight: -9.516    loss:  9.460
iteration:  8    weight: -9.721    loss:  9.060
iteration:  9    weight: -9.852    loss:  8.887
iteration: 10    weight: -9.932    loss:  8.818