The aim of this lecture is to introduce some active learning algorithms for building potential-energy surfaces (PESs) of molecules with reduced computational costs. At the end of the lecture you will:

Understand and code some active learning algorithms.
Build a PES of a pyrrolewater dimer with a reduced amount of training points.

Many results of this lecture are based on this recently-published article.

Codes for this tutorial are available at: ‣.

Active learning (AL): optimizing the choice of datasets

We have seen in the previous lecture that building a potential-energy surface of molecules (PES) involves two steps:

Choosing a certain set of molecular geometries $\{X_i\}_i ^N$, and run electronic-structure calculations on them to compute the corresponding energies $\{E_i\}_i ^N$
Optimising the parameters $\theta$ of some machine learning (ML) model $g$ to fit the data, i.e., minimising some loss function:

$$ \underset{\theta}{\text{min }} \sum_{ (X_i, E_i) \in \mathcal{D}} \mathcal{L} (g(x_i;\theta), E_i) $$

  where $\\mathcal{L}$, in the case of building PESs, is often the root-mean-squared error.

The following diagram summarises this process:

PL.001.jpeg

However, the first task in this process is not always trivial. The process of selecting the molecular geometries requires a physical insight into the problem. More importantly, the electronic-energies can be computationally expensive to compute if one requires high accuracy. The following table shows how badly these computations scale as a function of the size of the molecular system:

table.001.jpeg

where accuracy here is the accuracy of the (ro-)vibrational calculations.

Thus, when building PESs, one would want to reduce the number of training data needed to achieve a certain accuracy. This is the essential problem in active learning. AL can be understood as a double optimisation problem; over the set of all possible parameters, and over the set of all possible datasets, i.e.,

$$ \underset{\theta, \mathcal{D}}{\text{min }} \sum_{ (X_i, E_i) \in \mathcal{D}} \mathcal{L} (g(x_i;\theta), E_i) $$

There are several kinds of active-learning algorithms, but we'll only be looking at pool-based active learning.