March 2026
We want LLMs to be good continual learners, meaning that they learn to learn from trial-and-error. This enables emergent agentic capabilities including error recovery, dynamic tool learning, and better just-in-time retrieval, all via test-time exploration. Examples include:
Why do we need such behaviors? The fundamental reason is that we do not have complete coverage of the prompt space. If we knew all possible user queries in advance, we could simply apply RL to memorize the optimal trajectories, without the need for meta-learning! This epistemic uncertainty makes the Markov decision process (MDP) partially observable (POMDP). As a result, the optimal policy becomes context-adaptive, $\pi(\cdot\mid h)$, as opposed to the Markovian policy $\pi(\cdot\mid s)$ in RL. Please refer to the Method 2 section for a formal discussion.
$$ s_0 \rightarrow a_0 \rightarrow f_0 \rightarrow a_1 \rightarrow f_1 \rightarrow a_2 \rightarrow r_0 \tag{1} $$
This formula describes current LLM workflows. Here, $s_0$ is the prompt, $a_t$ is the LLM response that can involve tool calling, $f_t$ is the execution feedback from those tools, and $r_0$ is the outcome reward ($0$ or $1$).
To train good continual learners, we instead want the following workflow:
$$ \underbrace{s_0 \rightarrow a_0 \rightarrow f_0 \rightarrow a_1 \rightarrow f_1 \rightarrow a_2 \rightarrow r_0}{\text{episode 0}} \textcolor{blue}{\rightarrow} \underbrace{\textcolor{blue}{a_3 \rightarrow f_3 \rightarrow a_4 \rightarrow r_1}}{\text{episode 1}}\textcolor{blue}{\rightarrow}\underbrace{\textcolor{blue}{\cdots \rightarrow r_{k-1}}}_{\text{episode k-1}} \qquad\tag{2} $$
Here we extend Equation 1 by defining a trial that contains multiple episodes. This setup encourages the model to perform context-gathering behaviors through exploration. The procedure is a standard (in-context) meta-RL formulation, with objective
$$ \max_{\pi(\cdot\mid h)} \mathbb{E}{a\sim\pi(\cdot\mid h)}\Biggl[\sum{j=0}^{k-1} \gamma^{j}r_j\Biggr]\qquad\tag{3}. $$
Recall that we need an adaptive policy $\pi(\cdot\mid h_t)$ that depends on the historical context $h_t = s_t\oplus r_{0:t}$, instead of only the state $s_t = s_0\oplus a_{0:t}\oplus f_{0:t}$. There exist works that optimize policies that adapt to reward feedback. However, ground-truth rewards are generally unavailable at test time. There are two possible solutions:
In practice, we may introduce special tokens to mark the end of each episode (e.g., after generating $a_2$ and $a_4$ in Equation 2). These markers indicate when the verifier should be called to produce the reward. It may also be beneficial to explicitly train the model's self-verification accuracy as done in previous works.
We may also need to construct the continued pre-training or mid-training data that follows the structure of Equation 2. Such data can be generated using scaffoldings that enforce trial-and-error processes, or by leveraging naturally occurring error-correction data on the internet, such as chain-of-PRs.
We now formalize why adaptive policies that learn from trial and error are necessary. Please refer to our previous work for more details.
The conventional RL objective in its most general form is