First return, then explore

What?

Go-Explore, a family of algorithms that solve detachment&derailment problems (description below) to improve the performance when rewards are sparse or deceptive.

Why?

RL is great, but exploration+learning is so hard.

How?

Source: original paper ❤️❤️❤️

TL;DR 1. Go. 2. Explore. 3. ??? 4. Profit!

The paper identifies two types of problems with RL agents:

First room of Montezuma's Revenge used in my example below.

Detachment
- You forget how to come back to a really promising state. E.g. you get to a platform with the key on the first level of Montezuma's Revenge, you forget how to get back (to grab the key), because you start exploring the upper platforms of the first room.
Derailment
- Even if you want to return to the key from the example above, exploratory actions along the way might derail you from getting there. The longer the sequence you need to repeat is, the harder it is not to get derailed.

There was a really nice picture in the original preprint, which the authors omitted for some reason in the Nature version (actually, you can find it in the supplementary). I copypaste it below:

Source: Original Preprint. https://arxiv.org/abs/1901.10995

So, to solve detachment and derailment, Go-Explore splits the training process into two main phases: exploration and robustification. First, an agent selects the most promising cells (downsampled states) from an archive of encountered cells to randomly explore from there. The archive is updated and the process is repeated again. This can be done either with saving environmental states and restoring them, or exploiting environment determinism. After a good solution is found, the policy is trained with learning from demonstrations to be able to repeat the best solution stored in the archive.

My traditional pseudo ~~science~~code below:

def state2cell(state):
  return downsample(state)

def init_cell(cell):
  return (0,1, ...) # score, visits, ...

def upd_cell(archive, cell, cumulative_return=0):
  if cell not in archive:
      archive[cell] = (**init_cell(cell), env.entire_state)
   else:
      # update counters and cumulative return here
      archive[cell] = (**init_cell(cell), env.entire_state)

def update_archive(rollouts, archive):
   for s,cumulative_return in rollouts:
     cell = state2cell(cell)
     upd_cell(archive, cell, cumulative_return)

def solve_agi(env)
  # EXPLORATION PHASE
  # a dict of cell:(score, visits, ...) pairs. 
  archive = {} 
  done = True
  state = env.reset()
  cell = state2cell(state)

  agi_solved = False
  while not agi_solved:
    cell = select_promising_cell(archive)
    rollouts = get_rollouts(cell) # do random exploration
    update_archive(rollouts)
    agi_solved = is_agi_solved(archive)
    
  # ROBUSTIFICATION PHASE
  demonstrations = archive2demonstrations(archive)
  agi = learning_from_demonstrations(demonstrations)
  return agi