TLDR: The first half of agent research was an engineering race — longer reasoning chains, larger action spaces, deeper agentic workflows. It worked, but it is hitting diminishing returns. The second half is about turning agents into a science: asking not only "does it work?" but "why does it work, and when should it?" In this post I walk through one attempt to answer that — the Theory of Agent (ToA) — and show how it connects to what frontier labs are actually doing: long-context reasoning, self-evolving agents, tool-use RL, and the search for principled alignment.

Why Second Half?

Pre-training has its scaling laws. Alignment has RLHF. Reasoning has "test-time compute." What does agent research have?

Mostly benchmarks and heuristics. ReAct [Yao et al., 2023] tells us to interleave reasoning and acting. Tool-integrated RL — Search-R1 [Jin et al., 2025], ToRA [Gou et al., 2024], DeepSeek-R1 [DeepSeek-AI, 2025] — tells us to optimize task success. Workflow frameworks (AutoGen, CrewAI, LangGraph) tell us to compose agents. Each works but few of them answers the question the agent actually faces at every step of execution:

Should I continue reasoning internally, or should I reach out to the external world to acquire more information?

If you look carefully at the failure modes dominating agent research over the past two years, almost all of them are different ways of getting this one question wrong:

Failure mode	What's happening
Underthinking [Wang et al., 2025, NeurIPS]	Abandons reasoning too early, keeps switching approaches
Overthinking [Chen et al., 2025, ICML; Cuadron et al., 2025]	Keeps reasoning when a tool call would resolve the uncertainty instantly
Underacting	Misses the tool call that would unlock the task
Overacting [OSWorld-Human, ICML WCUA 2025]	Takes 1.4–2.7× more steps than necessary, even on correct trajectories

These are not four different bugs. They are one structural problem — miscalibrated decisions under epistemic uncertainty — showing up in four directions. The first half of agent work treated each symptom with its own patch. The second half has to treat the disease.

A Scene You've Definitely Seen Before: Two Perfect Scores on the Same Test

Imagine the same test handed to two students.

Student A takes it as a closed-book exam. He works through every problem on his own: recalling concepts, reasoning it out, double-checking his math, mentally reorganizing the material when he needs to. He treats the test as an opportunity to strengthen his fundamentals and sharpen his thinking.
Student B takes it as an open-book exam. For every question, he Googles it, asks ChatGPT, flips through the answer key — and copies the answer straight onto his paper.

When the tests are graded, both get 100. If the teacher only looks at the score, these two students are identical. But anyone who has taught — or simply been a student — knows that a semester later, the gap between them will be enormous:

Student A: Even when he got things wrong or took the long way around, every problem he worked through thickened that muscle we call "problem-solving intuition." By the time finals roll around, he solves the same kind of problem faster and more reliably — and he can generalize it to new ones.
Student B: He also "did" a semester's worth of problems, but the knowledge inventory in his head hasn't changed at all. The moment he faces a truly closed-book exam — or any situation where ChatGPT isn't available — he'll suddenly discover he knows nothing.

Two perfect scores. Two completely opposite growth trajectories.

<aside> ⚠️

A common misreading to clear up first: This story is not saying "Student A doesn't know how to use search engines" or "using tools is bad." Quite the opposite — A can and should use tools when he actually needs to. (The exam is a metaphor; in the real world, an Agent will absolutely hit problems it cannot answer on its own no matter what, and in those cases it must call on external information.)

</aside>