The gradient of the neural network loss function

Training a neuron requires that we take the derivative of our loss or “cost” function with respect to the parameters of our model, $\vec{w}$ and b. In order to train, we need data. The data come in two flavors: dependent ($X)$ and independent (y) also known as targets/labels.

The data, X and $\vec{y}$

When we train a neural network we train on many data in the form of vectors e.g. a vector of images. Therefore we need to define a new $X$ which is a matrix defined as a vector of column vectors.

$$ X = [\vec{x}_1,\vec{x}_2,\vec{x}_3,...,\vec{x}_N]^T $$

So, each $\vec{x}_i$ inside $X$ is a column matrix representing an image and $X$ has N images i.e. $|X| = N$.

Now, let’s define the vector of targets. Each image say has one single label, e.g. an image $\vec{x}_i$ is a cat. So cat here is a label or target.

$$ \vec{y} = [target(\vec{x}_1), target(\vec{x}_2), target(\vec{x}_3), ..., target(\vec{x}_N)] $$

We ideally have as many targets as images so $|\vec{y}| = N$ and $y_i$ is a scalar here (think about the name cat).

The Cost Function, C

The cost function is what we minimize when we train the network as mentioned above. There are some common cost functions that are known to work but one can also define their own special cost functions (e.g. a differential equation) if needed. In our example, we will stick to something popular - the mean squared error (MSE).

$$ C(\vec{w}, b, X, \vec{y}) = \frac{1}{N}\sum_{i=1}^N(y_i - activation(\vec{x}i))^2 = \frac{1}{N}\sum{i=1}^N(y_i - max(0,\vec{w}\cdot \vec{x}_i + b))^2 $$

As the name suggests, this is the mean of the difference in the actual label ($y_i)$ and the predicted label ($\hat{y}=activation(\vec{x}_i)$) over all thedata, e.g. all the images.

Since, we will be taking the derivative of this w.r.t $\vec{w}$ and $b$ we will need to define some intermediate variables/functions.

$$ u(\vec{w}, \vec{x}, b) = max(0, \vec{w}\cdot \vec{x} + b) $$

$$ v(y, u) = y - u $$

$$ C(v) = \frac{1}{N}\sum_{i=1}^Nu^2 $$