Go-Explore, a family of algorithms that solve detachment&derailment problems (description below) to improve the performance when rewards are sparse or deceptive.
RL is great, but exploration+learning is so hard.

Source: original paper ❤️❤️❤️
TL;DR 1. Go. 2. Explore. 3. ??? 4. Profit!
The paper identifies two types of problems with RL agents:

First room of Montezuma's Revenge used in my example below.
There was a really nice picture in the original preprint, which the authors omitted for some reason in the Nature version (actually, you can find it in the supplementary). I copypaste it below:

Source: Original Preprint. https://arxiv.org/abs/1901.10995
So, to solve detachment and derailment, Go-Explore splits the training process into two main phases: exploration and robustification. First, an agent selects the most promising cells (downsampled states) from an archive of encountered cells to randomly explore from there. The archive is updated and the process is repeated again. This can be done either with saving environmental states and restoring them, or exploiting environment determinism. After a good solution is found, the policy is trained with learning from demonstrations to be able to repeat the best solution stored in the archive.
My traditional pseudo sciencecode below:
def state2cell(state):
return downsample(state)
def init_cell(cell):
return (0,1, ...) # score, visits, ...
def upd_cell(archive, cell, cumulative_return=0):
if cell not in archive:
archive[cell] = (**init_cell(cell), env.entire_state)
else:
# update counters and cumulative return here
archive[cell] = (**init_cell(cell), env.entire_state)
def update_archive(rollouts, archive):
for s,cumulative_return in rollouts:
cell = state2cell(cell)
upd_cell(archive, cell, cumulative_return)
def solve_agi(env)
# EXPLORATION PHASE
# a dict of cell:(score, visits, ...) pairs.
archive = {}
done = True
state = env.reset()
cell = state2cell(state)
agi_solved = False
while not agi_solved:
cell = select_promising_cell(archive)
rollouts = get_rollouts(cell) # do random exploration
update_archive(rollouts)
agi_solved = is_agi_solved(archive)
# ROBUSTIFICATION PHASE
demonstrations = archive2demonstrations(archive)
agi = learning_from_demonstrations(demonstrations)
return agi