What?

The study of replay buffer (nonparametric transition model) vs a parametric transition model.

Why?

Can a model-free with experience replay be better than a Dyna-like algorithm?

How?

source: original paper

source: original paper

The authors use planning in a very specific way. Actually, they use in two ways. First, they use planning to denote any additional computation to improve the predictions. Second, they use planning as a replacement for updating the agent with the data sampled from the model.

A general boilerplate for the Dyna-style algorithm:

def mbrl(state_distr, model, policy, value, env):
  s = env.reset() # reset/done omitted for simplicity
  for iter in range(K):
    # usual RL stage + learning the model
    for step in range(M):
      a = policy(s)
      r,snext = env.step(a)
      m,state_distr = update_model(s,a,r,snext, model, state_distr)
      pi,v = update_agent(s,a,r,s', pi, v)
      s = snext
   # planning stage
   # data sampled from the model only!
   for pstep in range(P):
     s, a = state_distr.sample()
     r, snext = model(s,a)
     pi,v = update_agent(s,a,r,snext, pi, v)
     

And?