(you have to click on the right-pointing triangle of toggle-able sections to see them if they are hidden)

(tinyurl link to this page: https://tinyurl.com/gflownet-tutorial)

(see also Emmanuel Bengio's initial blog post on GFlowNets)

The Gist

A GFlowNet is a trained stochastic policy or generative model, trained such that it samples objects $x$ through a sequence of constructive steps, with probability proportional to $R(x)$, where $R$ is some given non-negative integrable reward function.

It is a stochastic policy because it makes sense to apply it when constructing $x$ can be done through a sequence of steps, which can be thought of as internal actions (they are meant to construct things like thoughts or plans or explanations). That sequential construction is convenient to sample compositional objects, which can often be defined by composing elements in some order, with generally many possible orders leading to the same object.

Figure 1. A constructive trajectory to build a lego object goes through a sequence of steps (each adding a lego block). The same object could be obtained in many different ways. The state of the GFlowNet describes the object (or partially constructed object) and the trajectories leading into it specify how it could have been build. The forward-sampling policy of the GFlowNet adds one lego block at a time, but we can also define a backward-sampling policy that removes lego blocks, and they can be combined (e.g., to form a Markov chain by which we can fix our current attempt at a lego boat). When we decide to show our construction, we may get a reward which is higher for boat-looking constructions. The GFlowNet forward-sampling policy tells us how we could generate the potentially exponentially many boat-looking constructions.

Figure 1. A constructive trajectory to build a lego object goes through a sequence of steps (each adding a lego block). The same object could be obtained in many different ways. The state of the GFlowNet describes the object (or partially constructed object) and the trajectories leading into it specify how it could have been build. The forward-sampling policy of the GFlowNet adds one lego block at a time, but we can also define a backward-sampling policy that removes lego blocks, and they can be combined (e.g., to form a Markov chain by which we can fix our current attempt at a lego boat). When we decide to show our construction, we may get a reward which is higher for boat-looking constructions. The GFlowNet forward-sampling policy tells us how we could generate the potentially exponentially many boat-looking constructions.

A neural net can be used to sample each of these forward-going constructive actions, one at a time. An object $x$ is done being constructed when a special "exit" action or a deterministic criterion of the state (e.g., $x$ has exactly $n$ elements) has been triggered, after which we can get a reward $R(x)$. Seen as a stochastic policy, the GFlowNet has an action space (for deciding what to do at each step) and a state space (for the partially constructed objects). Each sequence of actions $(a_0, a_1, \ldots)$ forms a trajectory $\tau$ that is a sequence of states $(s_0,s_1,s_2,\ldots)$. There may be many trajectories that lead to the same state $s_t$ (think about how one can construct a set by inserting elements in any order, or build a graph by pasting graph pieces again in many possible orders). The last state $s_n$ of a complete trajectory $\tau$ is an object $x \in \cal X$ that the GFlowNet can sample, and we would like it to be sample with probability proportional to $R(x)$. Note that [2] discusses how to generalize this to having intermediate rewards and not just at the end of the trajectory.

A GFlowNet is a generative model because after (and during) training we can sample from it, but in its most basic form its training objective is about matching a reward function $R$ rather than fitting a dataset. Going back to the lego construction, we are not trying to find the most boat-looking lego structure, but a way to sample from all the boat-looking structures. Because of the sequential construction of objects $x$ (with an unbounded number of steps), it is very convenient to generate variable size structured objects, like sets, graphs, lists, programs or other recursively constructed data structures. Note that a GFlowNet can also define a backward-sampling procedure, i.e., given a constructed object, we could sample a plausible trajectory that could have led to it. The forward-sampling policy is called $P_F$ below and the backward-sampling policy is called $P_B$.

In terms of neural net architectures, the simplest GFlowNet architecture is one where we have a neural net that outputs a stochastic policy $\pi(a_t|s_t)=P_F(s_{t+1}|s_t)$, where $s_t$ represents a partially constructed object and $a_t$ is one of the possible actions from $s_t$, leading to new state $s_{t+1}$. $P_F$ stands for the "forward" transition probability along the GFlowNet constructive process. The same neural net is used at every step, but it produces a stochastic output $a_t$, from which a next state $s_{t+1}=T(s_t,a_t)$ is obtained (the form of $T$ is application-dependent, e.g., following the rules of chemistry if $s_t$ is a partially constructed molecule represented as a graph and $a_t$ puts an atom as a specific new node connected to an existing node in the graph). Since the state is a variable-size object, the neural net better have an appropriate architecture for taking such objects in input (e.g., an RNN, a graph neural net or a transformer). We can thus see the sequential application of the GFlowNet decisions as a particular form of recurrent stochastic net where the hidden recurrent state ($s_t$) is stochastic.

Figure 2: the most basic component of a GFlowNet is a neural network that defines its constructive policy, i.e., how to sample the next state $s_{t+1}$ given the previous state $s_t$, through the choice of an action $a_t$. The new state $s_{t+1}$ becomes the input for the next application of the GFLowNet policy, until a special "exit" action signals the end of the sequence, that an object $x=s_n$ has been generated, and a reward $R(x)$ is provided to encourage or discourage the policy from generating such a solution. We would like $x$ to be generated with probability proportional to $R(x)$.

Figure 2: the most basic component of a GFlowNet is a neural network that defines its constructive policy, i.e., how to sample the next state $s_{t+1}$ given the previous state $s_t$, through the choice of an action $a_t$. The new state $s_{t+1}$ becomes the input for the next application of the GFLowNet policy, until a special "exit" action signals the end of the sequence, that an object $x=s_n$ has been generated, and a reward $R(x)$ is provided to encourage or discourage the policy from generating such a solution. We would like $x$ to be generated with probability proportional to $R(x)$.

Where would we (or the brain) get the energy or reward function in practice?

A trained GFlowNet is both a sampler (to generate objects $x$ with probability proportional to $R(x)$) and an inference machine (it can be used to answer questions and predict probabilities about some variables in $x$ given other variables, marginalizing over the others). But that is with respect to a given energy function ${\cal E}(x) = -\log R(x)$ for that joint probability model. Where would a learning agent get that energy function from?

We are working on two approaches to do that (more details in this section below):