What?

Go-Explore, a family of algorithms that solve detachment&derailment problems (description below) to improve the performance when rewards are sparse or deceptive.

Why?

RL is great, but exploration+learning is so hard.

How?

Source: original paper ❤️❤️❤️

Source: original paper ❤️❤️❤️

TL;DR 1. Go. 2. Explore. 3. ??? 4. Profit!

The paper identifies two types of problems with RL agents:

First room of Montezuma's Revenge used in my example below.

First room of Montezuma's Revenge used in my example below.

There was a really nice picture in the original preprint, which the authors omitted for some reason in the Nature version (actually, you can find it in the supplementary). I copypaste it below:

Source: Original Preprint. https://arxiv.org/abs/1901.10995

Source: Original Preprint. https://arxiv.org/abs/1901.10995

So, to solve detachment and derailment, Go-Explore splits the training process into two main phases: exploration and robustification. First, an agent selects the most promising cells (downsampled states) from an archive of encountered cells to randomly explore from there. The archive is updated and the process is repeated again. This can be done either with saving environmental states and restoring them, or exploiting environment determinism. After a good solution is found, the policy is trained with learning from demonstrations to be able to repeat the best solution stored in the archive.

My traditional pseudo sciencecode below:

def state2cell(state):
  return downsample(state)

def init_cell(cell):
  return (0,1, ...) # score, visits, ...

def upd_cell(archive, cell, cumulative_return=0):
  if cell not in archive:
      archive[cell] = (**init_cell(cell), env.entire_state)
   else:
      # update counters and cumulative return here
      archive[cell] = (**init_cell(cell), env.entire_state)

def update_archive(rollouts, archive):
   for s,cumulative_return in rollouts:
     cell = state2cell(cell)
     upd_cell(archive, cell, cumulative_return)

def solve_agi(env)
  # EXPLORATION PHASE
  # a dict of cell:(score, visits, ...) pairs. 
  archive = {} 
  done = True
  state = env.reset()
  cell = state2cell(state)

  agi_solved = False
  while not agi_solved:
    cell = select_promising_cell(archive)
    rollouts = get_rollouts(cell) # do random exploration
    update_archive(rollouts)
    agi_solved = is_agi_solved(archive)
    
  # ROBUSTIFICATION PHASE
  demonstrations = archive2demonstrations(archive)
  agi = learning_from_demonstrations(demonstrations)
  return agi