Introduction | Notion

Training neural networks these days requires mostly learning to use tools like PyTorch, Tensorflow and such. Jeremy's courses show how to become a world-class deep learning practitioner with only a minimal level of scalar calculus, thanks to leveraging the automatic differentiation built in to modern deep learning libraries.

But if you really want to really understand what’s going on under the hood of these libraries, and grok academic papers discussing the latest advances in model training techniques, you’ll need to understand certain bits of the field of matrix calculus.

For example the activation of a neuron in a neural network layer is typically calculated using a dot product of an edge weight (w) and an input vector (v) plus a scalar **bias (**b). So, if we call the activation of a neuron in a neural network layer, z, then you get:

$$ z(\vec{x}) = \sum_{i}^{n} w_ix_i + b = \bold{w}\cdot\bold{x} + b $$

$z(\vec{x})$ is called the unit’s affine function and is usually passed through a rectified linear unit (to introduce non-linearity) which clips negative values to zero: max(0, $z(\vec{x})$ ).

Screen Shot 2023-02-11 at 6.41.08 PM.png

A NN layer is composed of many of these units. The activation of one layer’s units become the input of the next layer and so on until the activation from the last layer which is called the network output.

Training this neuron means choosing weights w and bias b so that we get the desired output for all N inputs x. To do that, we minimize a loss function that compares the network’s final activation(x) with the target(x) (desired output of x) for all input x vectors. To minimize the loss, we use some variation on gradient descent, such as plain stochastic gradient descent (SGD), SGD with momentum, or Adam. All of those require the partial derivative (the gradient) of activation(x) with respect to the model parameters w and b. Our goal is to gradually tweak w and b so that the overall loss function keeps getting smaller across all x inputs.

This article walks through the derivation of some important rules for computing partial derivatives with respect to vectors, particularly those useful for training neural networks. This field is known as matrix calculus. It is fairly easy to derive the derivative for the scalar version of a common loss function - Mean Square Error (MSE) - but in a NN almost always there are multiple inputs and (potentially) multiple network outputs. So, we really need general rules for the derivative of a function with respect to a vector and even rules for the derivative of a vector-valued function with respect to a vector.