The GFlowNets and Amortized Marginalization Tutorial

(see here for the more comprehensive GFlowNet tutorial, although the stuff below might be useful to first provide a gist of core ideas behind the ability of GFlowNets to learn to sample and marginalize over distributions for which exactly doing these tasks is intractable)

1. Simple MSE Criterion to Amortize an Intractable Expectation

Consider a set of intractable expectations that we would like to approximate (for any $x$)

$$ S(x) = \sum_y p(y|x) R(x,y) $$

where $y$ is a rich variable taking an exponential number of values and we know how to sample from $p(y|x)$.

We could train a neural net with input $x$, stochastic target $R(x,y)$ and MSE loss

$$ L = \left( \widehat{S}(x) - R(x,y) \right)^2 $$

where $y \sim p(y|x)$ and training estimator $\widehat{S}$ with parameters $\theta$ and the stochastic gradients $\frac{\partial L}{\partial \theta}$ would make $\widehat{S}$ converge to $S$ if it has enough capacity and is trained long enough.

For any new $x$, we would then have an amortized estimator $\widehat{S}(x)$ which in one pass through the network would give us an approximation of the intractable sum. We can consider this an efficient alternative to doing a Monte-Carlo approximation

$$ \widehat{S}{MC}(x) = {\rm mean}{y \sim p(y|x)} R(x,y) $$

which would require a potentially large number of samples and computations of $R(x,y)$ for each $x$ at run-time, especially if $p(y|x)$ is a rich multimodal distribution (for which just a few samples does not give us a good estimator of the expectation).

Besides the advantage of faster run-time, a crucial potential advantage of the amortized version is that it could benefit from generalizable structure in the product $p(y|x) R(x,y)$: if observing a training set of $(x, R(x,y))$ pairs can allow us to generalize to new pairs, then we may not need to train $\widehat{S}$ with an exponential number of examples before it captures that generalizable structure and provides good answers (i.e., $E_{Y|x}[R(x,Y)]$) on new $x$'s. The ability to generalize like this is actually what explains the remarkable success of machine learning (and in particular deep learning) in vast set of modern AI applications.

When we do not have a $p(y|x)$ that we can sample from easily, we can in principle use MCMC methods, that form chains of samples of $y$ whose distributions converge to the desired $p(y|x)$, and where the next sample is generally obtained from the previous one by a small stochastic change that favours increases in $p(y|x)$. Unfortunately, when the modes of $p(y|x)$ occupy a small volume (i.e. throwing darts does not find them) and these modes are well-separated (by low-probability regions), especially in high dimension, it tends to take exponential time to mix from one mode to another. However, this is leaving money on the table: the attempts $(x,y,R(x,y))$ actually contain information one could use, using ML, to guess where the yet unseen modes might be, given where some of the modes (those already observed), as illustrated below.

Untitled

The above figure illustrates that if we have already the three modes in red, we may guess the presence of a 4th mode on the top-right grid point, because the first three modes seem to align on a grid. The existence of such generalization structure is why amortized ML samplers can potentially do much better MCMC samplers.

2. GFN Criterion to Obtain a Sampler and Estimate Intractable Sums

Let us consider the situation where we do not have a handy $p(y|x)$ and our objective is just to approximate a set of intractable sums (for any $x$)

$$ S(x) = \sum_y R(x,y) $$

where we have the constraint that $R(x,y) \geq 0$ and $S(x)>0$. This may be useful to estimate a normalization constant for energy-based models or Bayesian posteriors (where $y$=parameters and $x$=data). Hence we may also be interested in the sampling policy