TLDR: The first half of agent research was an engineering race — longer reasoning chains, larger action spaces, deeper agentic workflows. It worked, but it is hitting diminishing returns. The second half is about turning agents into a science: asking not only "does it work?" but "why does it work, and when should it?" In this post I walk through one attempt to answer that — the Theory of Agent (ToA) — and show how it connects to what frontier labs are actually doing: long-context reasoning, self-evolving agents, tool-use RL, and the search for principled alignment.
Pre-training has its scaling laws. Alignment has RLHF. Reasoning has "test-time compute." What does agent research have?
Mostly benchmarks and heuristics. ReAct [Yao et al., 2023] tells us to interleave reasoning and acting. Tool-integrated RL — Search-R1 [Jin et al., 2025], ToRA [Gou et al., 2024], DeepSeek-R1 [DeepSeek-AI, 2025] — tells us to optimize task success. Workflow frameworks (AutoGen, CrewAI, LangGraph) tell us to compose agents. Each works but few of them answers the question the agent actually faces at every step of execution:
Should I continue reasoning internally, or should I reach out to the external world to acquire more information?
If you look carefully at the failure modes dominating agent research over the past two years, almost all of them are different ways of getting this one question wrong:
| Failure mode | What's happening |
|---|---|
| Underthinking [Wang et al., 2025, NeurIPS] | Abandons reasoning too early, keeps switching approaches |
| Overthinking [Chen et al., 2025, ICML; Cuadron et al., 2025] | Keeps reasoning when a tool call would resolve the uncertainty instantly |
| Underacting | Misses the tool call that would unlock the task |
| Overacting [OSWorld-Human, ICML WCUA 2025] | Takes 1.4–2.7× more steps than necessary, even on correct trajectories |
These are not four different bugs. They are one structural problem — miscalibrated decisions under epistemic uncertainty — showing up in four directions. The first half of agent work treated each symptom with its own patch. The second half has to treat the disease.
Imagine the same test handed to two students.
When the tests are graded, both get 100. If the teacher only looks at the score, these two students are identical. But anyone who has taught — or simply been a student — knows that a semester later, the gap between them will be enormous:
Two perfect scores. Two completely opposite growth trajectories.
<aside> ⚠️
A common misreading to clear up first: This story is not saying "Student A doesn't know how to use search engines" or "using tools is bad." Quite the opposite — A can and should use tools when he actually needs to. (The exam is a metaphor; in the real world, an Agent will absolutely hit problems it cannot answer on its own no matter what, and in those cases it must call on external information.)
</aside>