Author: Jiawei Wang First Published at Feb. 9, 2026. Work done as an Intern at Seed.
<aside> đź’ˇ
TL;DR
@online{wang2025experience,
title = {From Amnesia to Mastery: How Agents Learn Skills In-Context},
author = {Wang, Jiawei},
year = {2026},
month = feb,
url = {<https://www.notion.so/From-Amnesia-to-Mastery-How-Agents-Learn-Skills-In-Context-2ef968f937bc8068945fca6c69e659cf?source=copy_link>}
}
LLM agents are increasingly deployed in complex, multi-step environments such as spreadsheets, coding sandboxes, and web automation. Despite impressive zero-shot capabilities, these agents repeatedly suffer from a key limitation: they do not truly learn from experience.
In practice, an agent may solve a task correctly, encounter a similar task later, and still repeat the same trial-and-error process. Existing mechanisms—long-context memory, retrieval-augmented generation (RAG), or prompt engineering—primarily enable recall, not learning. As a result, agent performance remains brittle, inefficient, and highly sensitive to stochastic variation.
Updating model parameters via continual learning is a natural alternative, but it introduces substantial challenges including catastrophic forgetting, delayed feedback loops, and high operational costs. This motivates an intermediate paradigm:
Can LLM agents acquire reusable skills and strategies in context, without modifying their weights?
This question has sparked a wave of recent research. Works such as Learning on the Job[1], ReasoningBank[2], Evo-Memory[3], and FLEX[4] have pioneered the idea of abstracting agent trajectories into memory. These studies demonstrate that agents can indeed improve by "memorizing" past successes. However, a gap remains between raw recall and structured skill acquisition. Existing approaches often treat experience as a flat collection of trajectories or generic reflections, which can be noisy or difficult to generalize to strictly different contexts. Furthermore, few studies quantify the efficiency cost of learning—does the agent actually become "smarter" and faster, or does it merely stumble upon the answer with more guidance?
We propose Experience-Driven Learning (EDL) to address this question. EDL distinguishes itself by organizing experience into a fine-grained taxonomy—from atomic tool usage to high-level negative constraints—and rigorously filtering for quality. Through extensive experiments on SpreadsheetBench Verified[5], we show that structured experience does not just improve success rates; it significantly reduces execution steps, proving that the agent is learning efficient strategies rather than just memorizing answers.
In EDL, we do not treat experience as a flat log of history. Instead, we structure it into a fine-grained taxonomy that captures different levels of abstraction, serving as a latent strategy representation. By abstracting away environmental noise—such as specific cell addresses—this taxonomy extracts the intrinsic manifold of the strategy, ensuring robust semantic alignment and knowledge transfer across diverse task distributions.
Through iterative experimentation, we identified four complementary types of experience:
| Experience Type | Definition | Why it matters |
|---|---|---|
| 🛠️ Atomic Tool | Fine-grained API usage patterns. | Handles syntax nuances (e.g., openpyxl params). |
| đź“‹ Procedural Workflow | SOPs for common sub-tasks. | Prevents skipping steps in complex pipelines. |
| đź§ Meta Strategy | High-level reasoning principles. | Guides how to think and decompose problems. |
| đźš§ Negative Constraint | Explicit warnings on what NOT to do. | Prunes dead-ends based on past failures. |
This taxonomy allows experience to encode not only what to do, but also what not to do, and at what level of abstraction. Examples can be found at Appendix A.
<aside> đź’ˇ
Why this matters: This taxonomy allows our system to encode both positive guidance (what to do) and negative boundaries (what to avoid). As we show later in the experiments, the Negative Constraints are particularly critical for generalizing to unseen, difficult tasks.
</aside>
As shown in Figure 1, experiences are mined from N sampled trajectories of a baseline agent for each task. Rather than treating these trajectories as flat logs, we employ a multi-stage pipeline to extract structured knowledge based on outcome quality: