$\scriptsize \textcolor{gray}{\underline{\rm{Weijia\ Xu}} }$ $\scriptsize \textcolor{gray}{\underline{\rm{Alessandro\ Sordoni} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Chandan\ Singh} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Zelalem\ Gero} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Michel\ Galley} }}$ $\scriptsize \textcolor{gray}{\underline{\rm{Jianfeng\ Gao} }}$
$\scriptsize \rm{First\ published\ on\ Feb\ 24,\ 2026}$ | :github::
<aside>
TL;DR
We introduce Evolving Library, a test-time continual learning framework that enables large language models to accumulate, reuse, and evolve knowledge across problem instances without fine-tuning. The method maintains a shared library of extensible knowledge abstractions extracted and dynamically weighted based on the model’s own inference trajectories. Experiments on abstract reasoning, question answering, and coding benchmarks show that Evolving Library consistently improves performance with increasing test-time compute and outperforms best-of‑N sampling and per-instance knowledge accumulation baselines, demonstrating effective test-time learning across instances.
</aside>

Figure 1: Overview of the algorithm. Starting from an empty library, it runs through step 1-3 iteratively until the maximum number of iterations is reached. At step 1, it first randomly samples a task from the task pool, and a few knowledge abstractions from the library (if it is non-empty) given their weights. Given the task description and sampled abstractions, it samples M solutions from the LLM. Each solution is then assigned an estimated score, either by running it through the synthetic unit tests or majority voting. At step 2, it adds the abstractions extracted from the newly generated solutions and their Information Gain (IG) scores (based on the estimated scores of the solutions) to the library. Additionally, it also updates the Future IG scores of the abstractions sampled into the prompt for solving this task. Optionally, at step 3, new tasks can be created by prompting the LLM with the knowledge abstractions sampled from the library to further expand the high-valued abstractions in the library.

Figure 2: We introduce Evolving Library, a library of knowledge abstractions (e.g. functions or natural language statements) dynamically updated based on the rollout trajectories. It enables continuous learning from the rollout trajectories without model fine-tuning (Library Update). Each abstraction in the library is associated with a weight based on 1) the information gain (IG) which measures its usefulness in solving the problem from which the knowledge is extracted; and 2) future IG which estimates its potential in inspiring more useful knowledge to be generated in the future. The library can be further expanded through Task Creation, where the model generates new problems grounded by the most useful, extensible abstractions in the library and updates the library with new abstractions extracted from the solution trajectories.
Test-time scaling has proven to be a powerful way to boost large language model (LLM) performance at inference time. Existing approaches typically rely on either sampling multiple reasoning trajectories for the same problem and aggregating their outcomes (Wang et al., 2023; Weng et al., 2023) or iteratively refining a solution for each individual problem through repeated self-correction or search (Muennighoff et al., 2025; Zhang et al., 2025; Venkatraman et al., 2025). While effective, these methods treat each problem in isolation: the knowledge uncovered during the search process is discarded after inference, preventing the model from inducing generalizable knowledge across problems. As a result, similar errors are repeatedly made across different problems, leading to substantial inefficiency.
Humans, by contrast, can learn from past experience and encode that knowledge in abstract, reusable, and extensible forms. This allows them to accumulate and refine knowledge over time, enabling adaptation to novel situations with far less repeated effort.
Inspired by this, we propose Evolving Library, a new framework that enables LLMs to induce and develop extensible knowledge abstractions that are useful across problem instances. As shown in Figure 1, we build an evolving library of knowledge abstracted from the model’s own rollout trajectories across tasks. When faced with a new problem, it samples a few knowledge abstractions from the library and adapts them to solve the current problem. It then updates the library with new abstractions extracted from the rollout trajectories on this problem.
For the library to expand itself effectively over time, it needs to upweight the abstractions that are more useful in solving the tasks and more extensible to produce more advanced ones. Thus, for each item in the library, we compute its weight based on how well it helps solve the current problem (i.e. Information Gain) and brings up more useful knowledge for solving other problems (i.e. Future Information Gain), as shown in Figure 2. In the algorithm, these weights are dynamically updated at step 2 (Figure 1). We will describe how the weights are computed in the next section.
Beyond task-driven updates, Evolving Library further supports self-play-based evolution. By recombining existing knowledge in the library, the model generates new solvable tasks and searches for increasingly efficient solutions. Over time, this self-play process enables the model to build a progressively richer and more efficient repertoire of reusable knowledge.
The library is represented as a weighted collection of knowledge abstractions, where each abstraction encapsulates a piece of knowledge. In coding tasks, an abstraction may correspond to a function, while in knowledge-based reasoning tasks, it may take the form of a knowledge described in natural language.
Each item in the library is associated with a weight dynamically updated based on 1) the Information Gain (IG) and Future Information Gain (Future IG). which measures its usefulness in solving the problem from which the knowledge is extracted;2) IG which estimates its potential in inspiring more useful knowledge to be generated in the future.
The Information Gain of a knowledge z given a task x (with target answer y) and the current library $\mathcal{K}$ is measured by how much better a model $\Phi$ can solve x given z and $\mathcal{K}$, compared to its performance given just $\mathcal{K}$ without z:
$$ IG(z|(x,y),\mathcal{K})=\log P_\Phi(y|x,\mathcal{K} \cup z)-\log P_\Phi(y|x,\mathcal{K})-\lambda Cost(z|x,\mathcal{K}) $$
We constrain z by $Cost(z|x,\mathcal{K})$ to push the model to discover more concise knowledge. The cost can be computed based on the length or complexity of z. Intuitively, IG measures the usefulness of z in solving x given the cost constraint.
In practice, we estimate $P_\Phi(y|x,\mathcal{K} \cup z)$ by the estimated accuracy of solutions generated by the LLM given the task x, abstraction z and K abstractions sampled from the library $\mathcal{K}$. The accuracy of a solution is estimated by either running it through the synthetic unit tests generated by the LLM itself or matching it with the most consistent answer generated by the LLM given the current library (Wang et al., 2023). We further estimate the baseline score $P_\Phi(y|x,\mathcal{K})$ by the average estimated accuracy of solutions generated with 0-K abstractions sampled randomly from the library $\mathcal{K}$.
Furthermore, to encourage the discovery of extensible knowledge that can be evolved into more advanced ones, we introduce Future Information Gain (Future IG). The Future IG of z is measured by the expected IG of a new knowledge $z'$ generated based on $\mathcal{K} \cup z$ given a randomly sampled task (x’, y’):
$$ FutureIG(z|\mathcal{K})=\mathbb{E}{(x',y')} \mathbb{E}{z'\sim P_\Phi(\cdot|x',\mathcal{K} \cup z)}IG(z'|(x',y'),\mathcal{K} \cup z) $$