In this lecture, we will learn about automatic differentiation (AD) in reverse mode, which is computationally more efficient than forward mode in the case of numerous variables. We will use it to train a neural network to represent the molecular potential energy surface. This, along with the kinetic energy operator from the previous lecture, completes the quantum-mechanical Hamiltonian for describing vibrational motions in molecules.
AD in reverse mode corresponds to a two-phase process. The original function code is executed in the first phase, filling intermediate variables $v_i$ and recording dependencies in the computational graph using a book-keeping process. The derivatives are determined in the second phase by propagating adjoints $\bar{v}_i$ backwards from the outputs to the inputs.
Consider a function like $f(x_1,x_2) = \ln(x_1) + x_1x_2 - \sin(x_2)$. To evaluate it, we need to go through the following order of arithmetic operations:
$$ v_0 = x_1 \\ v_1 = x_2 \\ v_2 = \ln(v_0) \\ v_3 = v_0v_1 \\ v_4 = \sin(v_1) \\ v_5 = v_3 - v_4 \\ v_6 = v_2 + v_5 \\ f = v_6 $$
We can go forward and evaluate intermediate values $v_i$, for example, starting from $x_1=0.1$ and $x_2 = 0.5$:
$$ v_0 = 0.1 \\ v_1 = 0.5 \\ v_2 = \ln(0.1) = -2.30259 \\ v_3 = v_0v_1 = 0.1 * 0.5 = 0.05 \\ v_4 = \sin(v_1) = \sin(0.5) = 0.47943 \\ v_5 = v_3 - v_4 = 0.05 - 0.47943 = -0.42943 \\ v_6 = v_2 + v_5 = -2.30259 + -0.42943 = -2.73202 \\ f = v_6 = -2.73202 $$
This process constitutes to the first phase of the reverse-mode AD. Using this decomposition, we want now to calculate derivatives $\frac{\partial f}{\partial x_1}$ and $\frac{\partial f}{\partial x_2}$. We start from the result $f(v_6)$ and propagate our derivatives backwards to the initial variables $x_1$ and $x_2$:
$$
\bar{v}_6 = \frac{\partial f}{\partial v_6} = 1 \\
\bar{v}_2 = \bar{v}_6\frac{\partial v_6}{\partial v_2}=1
,~\bar{v}_5 = \bar{v}_6\frac{\partial v_6}{\partial v_5}=1 \\
\bar{v}_3 = \bar{v}_5\frac{\partial v_5}{\partial v_3}=1,~\bar{v}_4 = \bar{v}_5\frac{\partial v_5}{\partial v_4}=-1 \\
\bar{v}_1 = \bar{v}_4 \frac{\partial v_4}{\partial v_1}=-1*\cos(v_1)=-0.87758\\
\bar{v}_0 = \bar{v}_3\frac{\partial v_3}{\partial v_0}=v_1=0.5,~~~\bar{v}_1 = \bar{v}_1+ \bar{v}_3\frac{\partial v_3}{\partial v_1}
=\bar{v}_1 + v_0 = -0.77758
\\
\bar{v}_0 = \bar{v}_0 + \bar{v}_2\frac{\partial v_2}{\partial v_0}= \bar{v}_0 + \frac{1}{v_0} =10.5\\
\bar{x}_2 = \bar{v}_1 = -0.77758 \\
\bar{x}_1 = \bar{v}_0 = 10.5
$$
which gives us $\frac{\partial f}{\partial x_1}=\bar{x}_1 = 10.5$ and $\frac{\partial f}{\partial x_2}=\bar{x}_2 = -0.77758$.
This process may be seen more clearly if looked at in a visual manner. Instead of writing down intermediate steps, we use a computational graph to describe the whole process. This is a direct graph where the nodes represent variables, constants, or basic binary/unary operations, and the edges (arrows) reflect the flow of values from one node to the next. Our function can be represented by the following graph:
Computational graph representing f(x1, x2) = ln(x1) + x1 * x2 - sin(x2)
We can now visually see the process of propagating the derivatives backward: $\frac{\partial f}{\partial add}\frac{\partial add}{\partial ln}\frac{\partial ln}{\partial x_1} + \frac{\partial f}{\partial add}\frac{\partial add}{\partial sub}\frac{\partial sub}{\partial mul}\frac{\partial mul}{\partial x_1} \rightarrow \frac{\partial f}{\partial x_1} = 111/x_1 + 111*x_2 = 1/x_1 + x_2$.
Similarly, $\frac{\partial f}{\partial add}\frac{\partial add}{\partial sub}\frac{\partial sub}{\partial mul}\frac{\partial mul}{\partial x_2}+\frac{\partial f}{\partial add}\frac{\partial add}{\partial sub}\frac{\partial sub}{\partial sin}\frac{\partial sin}{\partial x_2} = 111x_1+11*(-1)*\cos(x_2) = x_1 - \cos(x_2)$. That is, by traversing the computational network in a breadth-first manner, beginning with the node representing our final function, we may propagate derivatives backwards until we reach the required derivative.
This simple example also demonstrates why differentiation in reverse mode may be preferable to differentiation in forward mode: we were able to get the derivative with respect to all variables in a single pass over the graph. Generally, if an $n$-variate function has a complexity $O(F)$, then its forward-mode gradient would have a complexity of $O(Fn)$ which is problematic when $n$ is big. The reverse-mode gradient, on the other hand, would scale as $O(F)$ regardless of how big $n$ is.
The first part of the reverse mode AD is to trace all arithmetic operations and intrinsic function calls going from function inputs to outputs. This kind of tracing yields a computational graph in which the input variables, various intermediate variables, arithmetic operations, and functions are represented by nodes. We can distinguish three types of nodes — variables, constants, and arithmetic operations; the latter can also be used to represent elementary functions. To represent each type of node, we create different classes by subclassing numpy ndarray
class.