$\scriptsize \textcolor{gray}{\underline{\rm{Weijia\ Xu}} }$ $\scriptsize \textcolor{gray}{\underline{\rm{Alessandro\ Sordoni} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Chandan\ Singh} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Zelalem\ Gero} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Michel\ Galley} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Eric\ Yuan} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Jianfeng\ Gao} }}$

$\scriptsize \rm{First\ published\ on\ Feb\ 24,\ 2026}$ | :github::

<aside>

TL;DR

We introduce Evolving Library, a test-time continual learning framework that enables large language models to accumulate, reuse, and evolve knowledge and skills across problem instances at test time. The method maintains a shared library of extensible knowledge abstractions extracted and dynamically weighted based on the model’s own inference trajectories accumulated at inference time. Experiments on math, coding and multi-turn agentic tasks show that Evolving Library consistently improves performance with increasing test-time compute and outperforms strong test-time scaling and memory-based baselines, demonstrating effective test-time learning across instances.

</aside>

Figure 1: Overview of the algorithm.

Figure 2: We introduce Evolving Library, a library of knowledge abstractions (e.g. functions or natural language statements) dynamically updated based on the rollout trajectories. It enables continuous learning from the rollout trajectories without model fine-tuning (Library Update). Each abstraction in the library is associated with a weight based on 1) the information gain (IG) which measures its usefulness in solving the problem from which the knowledge is extracted; and 2) future IG which estimates its potential in inspiring more useful knowledge to be generated in the future.

Test‑Time Continual Learning

Test-time scaling has proven to be a powerful way to boost large language model (LLM) performance at inference time. Existing approaches typically rely on either sampling multiple reasoning trajectories for the same problem and aggregating their outcomes (Wang et al., 2023; Weng et al., 2023) or iteratively refining a solution for each individual problem through repeated self-correction or search (Muennighoff et al., 2025; Zhang et al., 2025; Venkatraman et al., 2025). While effective, these methods treat each problem in isolation: the knowledge uncovered during the search process is discarded after inference, preventing the model from inducing generalizable knowledge across problems. As a result, similar errors are repeatedly made across different problems, leading to substantial inefficiency.

Humans, by contrast, can learn from past experience and encode that knowledge in abstract, reusable, and extensible forms. This allows them to accumulate and refine knowledge over time, enabling adaptation to novel situations with far less repeated effort.

Inspired by this, we propose Evolving Library, a new framework that enables LLMs to induce and develop extensible knowledge abstractions that are useful across problem instances. As shown in Figure 1, we build an evolving library of knowledge abstracted from the model’s own rollout trajectories across tasks. The evolving library contains two types of knowledge: 1) modular skills that can be reused or adapted to solve new tasks (e.g. functions from coding tasks and sub-problems and their solution trajectories from reasoning tasks); 2) reflective insights learned from past trajectories with potential mistakes.

When faced with a new problem, it samples knowledge abstractions of both types from the library (separately) and adapts them to solve the current problem. It then updates the library with new abstractions extracted from the rollout trajectories on this problem.

For the library to evolve itself effectively over time, it needs to extend existing abstractions into more advanced, generalized ones by merging new, similar abstractions extracted from different problems into existing ones, so that the extended abstractions become more generalized across different problem instances. Furthermore, to encourage the generation of abstractions that are extensible into more advanced ones, for each item in the library, we compute its weight based on how well it helps solve the current problem (i.e. Information Gain) and brings up more useful knowledge for solving other problems (i.e. Future Information Gain), as shown in Figure 2. In the algorithm, these weights are dynamically updated at step 2 (Figure 1). We will describe how the weights are computed in the next section.

Evolving Library

The library is represented as a weighted collection of knowledge abstractions. The library includes two types of knowledge: 1) modular skills that can be reused or adapted to solve new tasks (e.g. functions from coding tasks, sub-problems and their solution trajectories from reasoning tasks, and sub-goals with workflows to accomplish the goals from agentic tasks); 2) reflective insights learned from past trajectories with potential mistakes.

Each item in the library is associated with a weight dynamically updated based on 1) the Information Gain (IG), which measures its usefulness in solving the problem from which the knowledge is extracted (computed only for skill abstractions), and Future Information Gain (Future IG), which estimates its potential in inspiring more useful knowledge to be generated in the future.

The Information Gain of an abstraction z given a task x (with target answer y) and the current library $\mathcal{K}$ is measured by how much better a model $\Phi$ can solve x given $\mathcal{K}$ and z, compared to its performance given just $\mathcal{K}$ without z:

$$ IG(z|(x,y),\mathcal{K})=\log P_\Phi(y|x,\mathcal{K} \cup z)-\log P_\Phi(y|x,\mathcal{K}) $$

Intuitively, IG measures the usefulness of z in solving x given the cost constraint.

In practice, we estimate $P_\Phi(y|x,\mathcal{K} \cup z)$ by the estimated accuracy of solutions generated by the LLM given the task x, abstraction z and K abstractions sampled from the library $\mathcal{K}$. The accuracy of a solution is estimated by either running it through the synthetic unit tests generated by the LLM itself or LLM’s own judgment based on its majority answer or another LLM call. We further estimate the baseline score $P_\Phi(y|x,\mathcal{K})$ by the average estimated accuracy of solutions generated with 0-K abstractions sampled randomly from the library $\mathcal{K}$.

Furthermore, to encourage the discovery of extensible knowledge that can be evolved into more advanced ones, we introduce Future Information Gain (Future IG). The Future IG of z is measured by how much better a model $\Phi$ can solve x given $\mathcal{K}$, z and z’ evolved from $\mathcal{K} \cup z$, compared to its performance given $\mathcal{K}$ and new abstractions evolved from $\mathcal{K}$ without z given a randomly sampled task (x’, y’):

$$ FutureIG(z|\mathcal{K})=\mathbb{E}{(x',y')} \mathbb{E}{z'\sim P_\Phi(\cdot|x',\mathcal{K} \cup z), z''\sim P_\Phi(\cdot|x',\mathcal{K})} [\log P_\Phi(y|x,\mathcal{K} \cup z \cup z')-\log P_\Phi(y|x,\mathcal{K} \cup z'')] $$