Gradient descent is a commonly used algorithm aims at optimizing the variables in the machine learning model. Consider we have a nonlinear relation between the target $\mathbf{y}$ and the feature $\mathbf{y}=\frac{1}{1000}(10\mathbf{x})^3 + \frac{1}{20}(10\mathbf{x})^2+\frac{1}{100}10\mathbf{x}=\mathbf{x}^3+5\mathbf{x}^2+\frac{1}{10}\mathbf{x}$. Suppose we have some knowledge on the model and we know the coefficients of the exponenetial term. Now we would like to construct a machine learning model to fit the above-mentioned function: $\mathbf{\hat{y}}=\frac{1}{1000}(w\mathbf{x})^3 + \frac{1}{20}(w\mathbf{x})^2 + \frac{1}{100}(w\mathbf{x})$. It is trivial to see that $w=10$ gives us the best approximation of $\mathbf{y}$. Here, we will use a gradient descent algorithm to approximate this solution.
First, we need to define the objective function between $\mathbf{y}$ and $\mathbf{\hat{y}}$, let's use mean square error here:
$L(\mathbf{y}, \mathbf{\hat{y}})=\sum\limits_{i=1}^{m}(y_i - \hat{y}i)^2=\sum\limits{i=1}^{m}[y_i - (\frac{w^3x_{i}^3}{1000} + \frac{w^2x_{i}^2}{20} + \frac{wx_{i}}{100})]^2$
We can then compute the gradient of $L$ with respect to $w$ (see Objective Functions in Neural Network Basics ):
$\nabla_{\mathbf{w}}L=2(\frac{3w^2\mathbf{x}^3}{1000} + \frac{w\mathbf{x}^2}{10} + \frac{\mathbf{x}}{100})^T(\mathbf{\hat{y}} - \mathbf{y})$
The derived gradient will always point towards the opposite direction from the weight that yield local minimum of the loss surface. Therefore, we can use this property to update the weight to make sure it is closer to the local minimum:
$\mathbf{w}^{(t+1)}=\mathbf{w}^{(t)} - \eta\nabla_{\mathbf{w}}L(\mathbf{\hat{y}})$
where $\eta$ is a hyperparameter called learning rate, which is used to control the magnitude of each updates.
eta = 0.5
num_samples = 10
tf.random.set_seed(0)
x = tf.random.normal((num_samples, 1), mean=0, stddev=1)
y = x ** 3 + 5 * x ** 2 + 0.1 * x
w = tf.Variable(0.5)
mse = tf.keras.losses.MeanSquaredError()
for i in range(0, 10):
with tf.GradientTape() as g:
g.watch(w)
y_hat = ((w * x) ** 3 + 50 * (w * x) ** 2 + 10 * w * x) / 1000
loss = mse(y, y_hat)
gradient = g.gradient(loss, w)
w = w.assign(w - eta * gradient)

Now let's look into multiple variables regression. Given a nonlinear transformation of $\mathbf{X}$ involving two variables: $\mathbf{y}=ReLU(5.0\mathbf{x}_1 + 5.0\mathbf{x}_2)$, we would like to construct a predictor to approximate this function: $\hat{y}=ReLU(w_1\mathbf{x}_1+w_2\mathbf{x}_2)$:
$\mathbf{\hat{y}} = ReLU(\mathbf{X}\mathbf{w}) = ReLU(\begin{bmatrix} x_{1,1} & x_{1,2}\\ x_{2,1} & x_{2,2}\\ \vdots & \vdots \\ x_{m,1} & x_{m,2}\\\end{bmatrix}\begin{bmatrix} w_{1} \\ w_{2} \\\end{bmatrix})$
The squared loss function then become:
$L=\frac{1}{m}\sum\limits_{i=1}^{m}(y_i - \hat{y}i)^2=\frac{1}{m}\sum\limits{i=1}^{m}[y_i - ReLU(\sum\limits_{j=1}^{2}w_jx_{ij})]^2$
The gradient of $L$ with respect to $\mathbf{w}$ is:
$\nabla_{\mathbf{w}}L(\mathbf{\hat{y}})=2\frac{\partial(\text{ReLU}(\mathbf{X}))}{\partial\mathbf{w}}(\mathbf{\hat{y}}-\mathbf{y})$
Although we only show the example of two dimensional feature matrix here for simplicity. The algorithm can be easily generalized to higher dimension.
eta = 1
num_samples = 10
tf.random.set_seed(0)
x = tf.random.normal((num_samples, 2), mean=0, stddev=1)
w = tf.Variable([[8.], [10.]])
y = activations.relu(tf.matmul(x, w))
w = tf.Variable([[-15.], [-15.]])
mse = tf.keras.losses.MeanSquaredError()
for i in range(0, 10):
with tf.GradientTape() as g:
g.watch(w)
y_hat = activations.relu(tf.matmul(x, w))
loss = mse(y, y_hat)
gradient = g.gradient(loss, w)
w = w.assign(w - eta * gradient)

When we compute the loss and the gradient, we can either consider every sample or a subset of samples at a time. We call the iteration that every sample in the dataset has been used to update the weight an epoch. In each epoch, we can split the samples into different batches. Note that it is important to shuffle the order of samples in each epoch.

The gradient descent algorithm might not be able to find the optimal weights that globally minimizes the loss function for many reasons. For instance, if the learning rate is set to be too large, the updated process can be unstable and might lead to non-optimal approximation. This problem can be identified by inspecting the training loss in each epoch:
eta = 1.5
num_samples = 10
tf.random.set_seed(0)
x = tf.random.normal((num_samples, 1), mean=0, stddev=1)
y = x ** 3 + 5 * x ** 2 + 0.1 * x
w = tf.Variable(0.5)
for i in range(0, 10):
with tf.GradientTape() as g:
g.watch(w)
y_hat = ((w * x) ** 3 + 50 * (w * x) ** 2 + 10 * w * x) / 1000
loss = tf.reduce_mean(tf.pow(y - y_hat, 2))
gradient = g.gradient(loss, w)
w = w.assign(w - eta * gradient)
print(f'iteration:{i+1:>3}{"":>4}weight:{w.numpy():>7.3f}{"":>4}loss:{loss.numpy():>7.3f}')
iteration: 1 weight: 1.515 loss: 36.317
iteration: 2 weight: 4.452 loss: 34.980
iteration: 3 weight: 12.023 loss: 24.310
iteration: 4 weight: -2.383 loss: 8.710
iteration: 5 weight: -6.264 loss: 33.324
iteration: 6 weight:-12.476 loss: 18.812
iteration: 7 weight: -2.094 loss: 16.372
iteration: 8 weight: -5.565 loss: 34.032
iteration: 9 weight:-11.888 loss: 21.743
iteration: 10 weight: -4.581 loss: 12.915

However, if the learning rate is set to be too small. The learning process becomes extremely slow and might be stuck in the local minimum or saddle points. Therefore, it is always important to keep track of the loss during gradient descent. The difficulties described here is usually addressed using proper weight initialization or applying the optimizers, which we will explain in more details in other sections:
eta = 0.2
num_samples = 10
tf.random.set_seed(0)
x = tf.random.normal((num_samples, 1), mean=0, stddev=1)
y = x ** 3 + 5 * x ** 2 + 0.1 * x
w = tf.Variable(-5.)
for i in range(0, 10):
with tf.GradientTape() as g:
g.watch(w)
y_hat = ((w * x) ** 3 + 50 * (w * x) ** 2 + 10 * w * x) / 1000
loss = tf.reduce_mean(tf.pow(y - y_hat, 2))
gradient = g.gradient(loss, w)
w = w.assign(w - eta * gradient)
print(f'iteration:{i+1:>3}{"":>4}weight:{w.numpy():>7.3f}{"":>4}loss:{loss.numpy():>7.3f}')
iteration: 1 weight: -5.829 loss: 24.109
iteration: 2 weight: -6.671 loss: 20.628
iteration: 3 weight: -7.473 loss: 17.149
iteration: 4 weight: -8.184 loss: 14.095
iteration: 5 weight: -8.766 loss: 11.784
iteration: 6 weight: -9.206 loss: 10.289
iteration: 7 weight: -9.516 loss: 9.460
iteration: 8 weight: -9.721 loss: 9.060
iteration: 9 weight: -9.852 loss: 8.887
iteration: 10 weight: -9.932 loss: 8.818
