What?

Implicit (model-free) planning using:

Why?

Implicit (model-free) planning is a developing area. Value Iteration Networks (VINs) formalize value iteration as a convolution of state values and transition probabilities → into max-pooling, in case the MDP is a grid-world. Generalised VINs drop the grid-world assumption.

Three main problems here:

How?

TL;DR, source: https://arxiv.org/abs/2010.13146

TL;DR, source: https://arxiv.org/abs/2010.13146

TransE is an important component of XLVIN. I haven't read the original paper and describe TransE as it's described in XLVIN paper. First, the state is encoded to some latent space: $z: S\rightarrow\mathbf{R}^k.$ After that, to estimate the effect of a taken action, we add a translation vector to the latent: $h_{s,a_i} = h_s+T(h_s,a_i)$. So, the intuition behind this is "what is the latent representation of the state we end up in after taking an action in a state, whose latent representation is $h_s.$ To learn all of this, we use a triplet loss function:

$$ \mathcal{L} = d(z(s) + T(z(s),a),z(s')) + \max\{0, \xi - d(z(\tilde{s}), z(s')))\}, $$

where $\tilde{s}$ is the negative sample state.

This paper continues the work of Neural Graph Execution that learns to do classical CS algorithms supervising the model on intermediate results to improve their transfer properties. VI is a perfect algorithm to combine with neural execution due to the dynamic programming nature of it (check this paper to learn more about algorithmic alignment).

The main idea of the paper is on the picture above. My hand wavy python pseudo sciencecode below 🐍 :

def xlvin(state, z, gnn, pi_net, vi_net) -> Tuple(float):
  h_s = z(s) # embed the state
  S = [{} for _ in range(0, K)] # init embedding sets, K is math depth
  S[0].add(h_s) # init depth-0 set of state embedding
  E = {} # init edges set
  Nb = {} # init neighbourhood map
  for depth in range(1, K):
    for h in S[depth-1]:
      for a in action_set:
        h_next = h + T(h,a) # compute expected neighbour embedding
        S[depth] += {h_next} # add embedding to the graph
        E  = E += {(h, h_next, a)} # update edges
      for h in S[depth-1]: # define the neighbourhood of h
        Nb[h] = [e.h_next for e in E if e.h == h and e.a in action_set]

  # run the execution model on graph (S,E) for K iterations
  # get output features for the node that stored s embedding
  h_out_emb = gnn(S,E, Nb).get_node_features(h_s) 
return pi_net(h_s, h_out_emb), vi_net(h_s, h_out_emb)   

All this beauty is trained using PPO with $T$ pretrained on samples from a random policy and executor pre-trained on a dataset of MDPs (or some random graphs, e.g. Erdos-Renyi or Barabasi-Albert). Everything is validated on a known MDP (8x8 grid worlds), continuous space observations (classical RL envs) and pixel-based envs (Atari).

And?