Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environmente

What?

Augmenting the model in offline MBRL for zero-shot generalisation.

Why?

Everyone can do offline MBRL now, but can you do ZERO-SHOT transfer from offline MBRL?

How?

In two words, we want to augment a model when training and train a policy doing the augmentation as well.

The authors consider three different augmentations, but eventually, they use one (DAS):

$$ \mathcal{T}_z: (s, a, r, s') \rightarrow (s,a,r,s+ z \odot (s'-s)). $$

The authors provide a nice algorithm listing, which I convert here to my typical pseudo~~science~~code below.

def train(D, penalty, horizon, bsize, augmentation, epochs):
	models_ensemble = init_models()
	policy = init_policy()
	buffer = []
	for epoch in range(epochs):
		upd_data = D.sample(bsize)
		rollout(policy, buffer, lambda, horizon)
		train_policy(policy, D.union(buffer), augmentation)
	return policy

A policy now takes a concatenation of the augmentation vector together with the state.

Now, for testing we do not have a proper augmentation vector, and the authors run a regression on the current rollout data and give the predicted context to the policy. In particular, they learn a forward model to predict the change in the state $\delta_t = s_{t+1}-s_t$ and get the augmentation vector $\hat{z}_t = {\delta_t}/ {\hat{\delta}_t}$.

And?

I like the paper, but there are important questions which I'd like to have the answers to:
- Why offline MBRL? Probably, the answer is in MOPO paper which I haven't read, but it would be nice to have a reminder here.
- I don't get how Equation 1 is used in the paper ($\inf_z D(\hat{P}_z(s,a)\mid\mid P^*(s,a))\leq \epsilon$).
- I know, it's not popular to talk about target tasks in zero-shot generalisation papers, but this is important. What are the assumptions on the tasks we are transferring to?
- In Figure 6, we can see that RAD augmentation decreases the performance of the agent. Why does this happen?
- In Table 2, we can see that not learning a context makes it better for zero-shot generalisation. Why is it so? I believe this question is related to the assumptions on tasks we consider.
I don't think that Figure 5 left is very informative. I don't believe that absolute numbers mean anything. This should be corrected by the performance of an MBRL agent trained on a respective dataset or at least just a online SAC agent to get a baseline performance. I believe it's much harder to learn a task for a particular set up (e.g. 0.25 mass, 0.25 damping) → decrease in score is not very indicative here.
I really like the way the authors build a narrative by asking questions and designing the experiments to answer them.
Kudos to the authors for testing statistical significance, however, it would be cool to have this for Table 2 as well.
I believe this work is highly relevant here. It's not about offline or MBRL, but it's relevant for the context selection, I believe.