We taught our agent to plan & route, beating Genspark on GAIA.

Leni hit 77.6% on GAIA validation. 128 out of 165 tasks. above Genspark, above Manus, above OpenAI Deep Research. the three most-discussed agentic systems of the past year. and we lead on every difficulty tier, including the hardest one.

we didn't build a custom browser stack. we didn't fine-tune for it. we didn't run a proprietary tool layer.

we taught one agent to plan, another to execute, and a router to pick the best model for each step from across providers. that's it. that's the whole thing.

here's what we learned.

the problem is not tool use. it's orchestration.

frontier models can use tools. Claude Opus 4.6 can browse a website, parse a spreadsheet, write Python, read an image. it's genuinely good at this.

bare Opus with tool access scores in the low 60s on GAIA validation.

that sounds high until you realize the top of the leaderboard is in the high 70s. and the gap isn't a reasoning gap. the model knew which tools to call. it just couldn't sequence them, recover from failures, and pick the right model for each step.

four failure modes dominate:

the planning-execution conflation. one model trying to plan a six-step trajectory while also executing each step has to context-switch on every turn. it loses sight of the overall goal. it rebuilds the plan implicitly each step, which means the plan drifts. step three quietly forgets what step one was looking for.

the model-task mismatch. a page-relevance classifier wants a fast cheap model. a multi-hop reasoning step wants a frontier reasoner with extended thinking. a vision step wants strong image grounding. forcing one model to do all of this means accepting the worst-case profile on every step. paying frontier prices for trivial classification and getting trivial-model accuracy on hard reasoning.

the cascade problem. errors don't stay local. a misread cell on step two becomes a wrong filter on step three becomes a wrong final answer on step six. the model has no checkpoint to ask "wait, does this intermediate result make sense before I commit it." by the time the answer is wrong, the chain has buried the cause four turns deep.

the tool surface problem. GAIA pulls from web pages, YouTube transcripts, PDFs, Excel files, images, audio, code. each tool has its own quirks. JavaScript-heavy pages defeat naive scrapers. scanned PDFs need OCR. multi-sheet workbooks hide answers on the third tab. a monolithic agent treats every tool the same. tasks fail not because the model couldn't reason about the answer, but because the tool layer never delivered usable input.

every team building agentic systems lives in this gap. the demos work. the long chains don't.

we solved it with three layers

not a better agent loop. a compound system.

layer 1: the planner

a frontier model knows how to call tools. it doesn't know how to hold a six-step plan in working memory while it's executing each step.