What?
Augmenting the model in offline MBRL for zero-shot generalisation.
Why?
Everyone can do offline MBRL now, but can you do ZERO-SHOT transfer from offline MBRL?
How?
In two words, we want to augment a model when training and train a policy doing the augmentation as well.
The authors consider three different augmentations, but eventually, they use one (DAS):
$$
\mathcal{T}_z: (s, a, r, s') \rightarrow (s,a,r,s+ z \odot (s'-s)).
$$
The authors provide a nice algorithm listing, which I convert here to my typical pseudosciencecode below.
def train(D, penalty, horizon, bsize, augmentation, epochs):
models_ensemble = init_models()
policy = init_policy()
buffer = []
for epoch in range(epochs):
upd_data = D.sample(bsize)
rollout(policy, buffer, lambda, horizon)
train_policy(policy, D.union(buffer), augmentation)
return policy
A policy now takes a concatenation of the augmentation vector together with the state.
Now, for testing we do not have a proper augmentation vector, and the authors run a regression on the current rollout data and give the predicted context to the policy. In particular, they learn a forward model to predict the change in the state $\delta_t = s_{t+1}-s_t$ and get the augmentation vector $\hat{z}_t = {\delta_t}/
{\hat{\delta}_t}$.
And?
- I like the paper, but there are important questions which I'd like to have the answers to:
- Why offline MBRL? Probably, the answer is in MOPO paper which I haven't read, but it would be nice to have a reminder here.
- I don't get how Equation 1 is used in the paper ($\inf_z D(\hat{P}_z(s,a)\mid\mid P^*(s,a))\leq \epsilon$).
- I know, it's not popular to talk about target tasks in zero-shot generalisation papers, but this is important. What are the assumptions on the tasks we are transferring to?
- In Figure 6, we can see that RAD augmentation decreases the performance of the agent. Why does this happen?
- In Table 2, we can see that not learning a context makes it better for zero-shot generalisation. Why is it so? I believe this question is related to the assumptions on tasks we consider.
- I don't think that Figure 5 left is very informative. I don't believe that absolute numbers mean anything. This should be corrected by the performance of an MBRL agent trained on a respective dataset or at least just a online SAC agent to get a baseline performance. I believe it's much harder to learn a task for a particular set up (e.g. 0.25 mass, 0.25 damping) → decrease in score is not very indicative here.
- I really like the way the authors build a narrative by asking questions and designing the experiments to answer them.
- Kudos to the authors for testing statistical significance, however, it would be cool to have this for Table 2 as well.
- I believe this work is highly relevant here. It's not about offline or MBRL, but it's relevant for the context selection, I believe.