March 2026

Motivation

We want LLMs to be good continual learners, meaning that they learn to learn from trial-and-error. This enables emergent agentic capabilities including error recovery, dynamic tool learning, and better just-in-time retrieval, all via test-time exploration. Examples include:

Performing self-verification by writing unit tests and adapting based on the feedback (e.g., SWE, terminal, or computer-use agents). This internalizes the hard-coded workflow (e.g., proposed by Anthropic or in OpenClaw) during model training, making the behavior more generalizable and allowing for a higher performance ceiling.
Creating tools or skills at test time and iteratively improving them based on execution results, so that users do not need to prepare all these perfectly in advance.

Why do we need such behaviors? The fundamental reason is that we do not have complete coverage of the prompt space. If we knew all possible user queries in advance, we could simply apply RL to memorize the optimal trajectories, without the need for meta-learning! This epistemic uncertainty makes the Markov decision process (MDP) partially observable (POMDP). As a result, the optimal policy becomes context-adaptive, $\pi(\cdot\mid h)$, as opposed to the Markovian policy $\pi(\cdot\mid s)$ in RL. Please refer to the Method 2 section for a formal discussion.

Method 1: Meta-RL

$$ s_0 \rightarrow a_0 \rightarrow f_0 \rightarrow a_1 \rightarrow f_1 \rightarrow a_2 \rightarrow r_0 \tag{1} $$

This formula describes current LLM workflows. Here, $s_0$ is the prompt, $a_t$ is the LLM response that can involve tool calling, $f_t$ is the execution feedback from those tools, and $r_0$ is the outcome reward ($0$ or $1$).

To train good continual learners, we instead want the following workflow:

$$ \underbrace{s_0 \rightarrow a_0 \rightarrow f_0 \rightarrow a_1 \rightarrow f_1 \rightarrow a_2 \rightarrow r_0}{\text{episode 0}} \textcolor{blue}{\rightarrow} \underbrace{\textcolor{blue}{a_3 \rightarrow f_3 \rightarrow a_4 \rightarrow r_1}}{\text{episode 1}}\textcolor{blue}{\rightarrow}\underbrace{\textcolor{blue}{\cdots \rightarrow r_{k-1}}}_{\text{episode k-1}} \qquad\tag{2} $$

Here we extend Equation 1 by defining a trial that contains multiple episodes. This setup encourages the model to perform context-gathering behaviors through exploration. The procedure is a standard (in-context) meta-RL formulation, with objective

$$ \max_{\pi(\cdot\mid h)} \mathbb{E}{a\sim\pi(\cdot\mid h)}\Biggl[\sum{j=0}^{k-1} \gamma^{j}r_j\Biggr]\qquad\tag{3}. $$

Recall that we need an adaptive policy $\pi(\cdot\mid h_t)$ that depends on the historical context $h_t = s_t\oplus r_{0:t}$, instead of only the state $s_t = s_0\oplus a_{0:t}\oplus f_{0:t}$. There exist works that optimize policies that adapt to reward feedback. However, ground-truth rewards are generally unavailable at test time. There are two possible solutions:

Approximate $r_j$ with $\hat{r}_j$ **through self-verification, e.g., the pass rate of self-written unit tests. The policy then conditions on $\hat{h}t=s_t\oplus \hat{r}{0:t}$ when generating $a_t$.
Simply ignore the reward. Since optimality requires an adaptive policy $\pi(\cdot\mid h_t)$, a Markovian policy $\pi(\cdot\mid s_t)$ must implicitly infer $r_{0:t}$ in order to behave optimally. This approach may be suitable for tasks that are difficult to self-verify, such as mathematical reasoning.

It may also be beneficial to explicitly train the model's self-verification accuracy as done in previous works.

We may also need to construct the continued pre-training or mid-training data that follows the structure of Equation 2. Such data can be generated using scaffoldings that enforce trial-and-error processes, or simply prompting for self-verification and looping, or by leveraging naturally occurring error-correction data on the internet, such as chain-of-PRs.

Method 2: Bayes-Adaptive RL

We now formalize why adaptive policies that learn from trial and error are necessary. Please refer to our previous work for more details.

The conventional RL objective in its most general form is