Understanding AI Agents: Architecture, Scaling, Memory, and Self-Improvement

I've built multiple AI agents, but understanding their design principles always felt fragmented. Fortunately, Stanford's CS329A changed that—it traces the evolution of agent architectures from first principles, showing how each innovation emerged from solving specific limitations. After reading all papers from its reading list, I've reorganized the key concepts here in a more narrative-driven format: from ReAct to Reflexion, from test-time scaling to train-time RL, from memory-as-text to memory-as-representation.

1. Foundation: Evolution of Agent Architectures

ReAct: The Starting Point

The simplest agent follows a three-step loop: Thought → Action → Observation. This is ReAct.

Here's how it works in practice. You give the agent a few-shot prompt showing examples of this loop. At each step:

Thought: The agent reasons about what to do next (task decomposition, intention detection, etc.)
Action: It executes something—calls a tool, generates content, or queries a database
Observation: The system observes the result and feeds it back as a new prompt

Then the loop repeats until the task is done.

The limitation? ReAct is terrible at error correction. If it makes a mistake in step 3, the mistake will be carried over into the next loops. It has no mechanism to look back and think what if sth does not work, or whether to try sth else.

Side note: LangChain has a variant of ReAct called "plan-and-execute" where the loop is different—first the agent drafts a plan, then executes multiple steps, then gives all results back to the planner to decide whether to continue or declare completion. It's still forward-only though.

Evolution to Reflexion

Reflexion fixes ReAct's biggest weakness by adding a feedback loop. Four components work together:

Actor: Takes actions (can be a ReAct agent or CoT-style reasoning agent)
Evaluator: Generates reward signals—a score, binary correct/incorrect, etc. (LLM-as-judge or rule-based)
Self-reflection model: When the task fails, this analyzes the trajectory + evaluator signal + memory to generate textual reflection ("why did this fail? how to improve?")
Memory: Stores reflections for future use

The workflow is as follows:

Actor uses short-term + long-term memory as context
  ↓
Takes action
  ↓
Evaluator judges the result
  ↓
If failure: Reflection model writes analysis → stored in long-term memory
  ↓
Next episode uses this reflection as context

Note that the Memory has two tiers: