Automatic differentiation

<aside> 💡

These are rough notes for my PhD course on automatic differentiation. On the website you can find accompanying slides and notebooks. The reference book is [BR24], and most of the material below is adapted from there.

</aside>

Derivatives and gradients

Differentiation for vector-valued functions

<aside> 💡

Chapter 6 in my book describes automatic differentiation for the case of vector-valued functions $f : \mathbb{R}^p \rightarrow \mathbb{R}^m$. We recall here the basic concepts, although we assume the material as known.

</aside>

We can define the Jacobian matrix of $f$ as the $m \times p$ matrix of partial derivatives (one row per output, one column per input):

$$ \partial_{ij} f(x) = \frac{\partial f_i (x)}{\partial x_j} $$

where $\frac{\partial f_i (x)}{\partial x_j}$ is the partial derivative of the i-th output w.r.t. the j-th input coordinate:

$$ \frac{\partial f_i(x)}{\partial x_j} =\lim_{h \rightarrow 0} \frac{f_i(x + he_j)-f_i(x)}{h } $$

We use $e_i$ to denote the $i$-th canonical basis vector in $\mathbb{R}^n$:

$$ (e_i)_j=\begin{cases} 1 & \; \text{ if } i=j \\ 0 & \; \text{ otherwise} \end{cases} $$

When expressed in this form, the chain rule becomes matrix multiplication of the corresponding Jacobians:

$$ \partial f(g(x))=\partial f(h) \partial g(x) $$

where $h = g(x)$ and the terms $\partial f(h)$, $\partial g(x)$ are interpreted as matrices. For functions with a single output, the Jacobian matrix has a single row, and its transpose is called the gradient of the function, denoted as $\nabla f(x)$. The directional derivative along a generic direction $u$ is then expressed as (ignore the notation on the left for now, it is only there to make it consistent with the rest of the notes):

$$ \partial f(x)[u] = \langle \nabla f(x), u \rangle $$

where we use triangular brackets to denote the inner product:

$$ \langle x, y \rangle = \sum_i x_iy_i = x^\top y \tag{1} $$

The equivalent concept for a vector-valued function is the Jacobian-vector product (JVP):

$$ \partial f(x)[u]= \partial f(x)u $$

We can also consider the transpose of this operation, which is called the vector-Jacobian product (VJP), which we denote as (again, ignore for now the notation on the left, I swear it will make sense in a bit):

$$ \partial f(x)^*[v] := \left(\partial f(x) \right)^\top v $$