77.6% on GAIA Validation | Leading All Three Difficulty Levels | Above Genspark, Manus, and OpenAI Deep Research
Leni Agent ยท April 2026
Large language models can use tools. They can call a browser, run code, read a PDF, transcribe audio. What they cannot do reliably, on tasks that require six tool calls in the right order, is orchestrate.
GAIA is the benchmark that exposes this. Each task asks a real-world question (find this fact in a YouTube video, cross-reference it with this PDF, compute the result) that no single tool can answer. The model has to plan a sequence, execute each step, recover from the inevitable failures, and integrate the results. A frontier model with raw tool access can write each individual step. It will still fail on the third or fourth turn, when a stale URL, a malformed PDF, or a misread cell silently corrupts the rest of the chain.
Leni is an AI Business Analyst. On the GAIA validation set [1], the agentic-AI gold standard from Meta and HuggingFace, Leni achieves 77.6% (128/165), with 88.6% on Level 1, 75.5% on Level 2, and 61.5% on Level 3. This places Leni above Genspark (75.4%), Manus (73.4%), and OpenAI Deep Research (67.4%), three of the most-discussed agentic systems of the past year, with a leading score on every difficulty tier including the hardest long-horizon trajectories.
We did this without a custom toolchain, without specialized fine-tuning, and without a proprietary browser stack. Instead, we solved a different problem: we made the agent route the work to the right model and the right tool at the right step. This paper describes the architecture behind that result, a planner-executor split with adaptive cross-provider model selection that closes what we call the orchestration gap.
Every team building agentic systems hits the same ceiling. Single-step tool use works. Two-step chains usually work. By step five, things are falling apart, not because any single tool failed, but because the chain itself was never coherent.
The core issue is not capability. Frontier models like Claude Opus 4.6 can browse a website, parse a spreadsheet, write Python, and reason about images. The core issue is orchestration: deciding which tool to call, in what order, with what intermediate state, and which model is best suited to which sub-task.
This creates four failure modes that no amount of tool engineering alone can fix:
The Planning-Execution Conflation. A single model trying to plan a multi-step trajectory while executing each step has to context-switch on every turn. It loses sight of the overall goal. It rebuilds the plan implicitly on each step, which means the plan drifts. Tasks that require holding a five-step trajectory in mind become tasks where step three quietly forgets what step one was looking for.
The Model-Task Mismatch. Different sub-tasks reward different model strengths. A page-classification call wants a fast, cheap model with low latency. A multi-hop reasoning step over scraped content wants a frontier reasoner with extended thinking. A vision step wants a model with strong image grounding. Forcing one model to do all of these means accepting the worst-case profile on every step: paying frontier prices for trivial classifications and getting trivial-model accuracy on hard reasoning.
The Cascade Problem. Errors do not stay local. A misread cell on step two becomes a wrong filter on step three becomes a wrong final answer on step six. The model has no checkpoint to ask, "Wait, does this intermediate result make sense before I commit it to the next step?" By the time the answer is wrong, the chain has buried the cause four turns deep.
The Tool Surface Problem. GAIA tasks pull from heterogeneous data: web pages, YouTube transcripts, PDFs, Excel files, images, audio, code. Each tool has its own quirks. JavaScript-heavy pages that defeat naive scrapers, scanned PDFs that need OCR, multi-sheet workbooks where the answer lives on the third tab. A monolithic agent treats every tool the same. Tasks fail not because the model couldn't reason about the answer, but because the tool layer never delivered usable input.
Bare frontier models, even Claude Opus 4.6 with tool use enabled, score in the high 50s to low 60s on GAIA validation. The 15-to-20-point gap between bare tool-use and the top of the leaderboard is almost entirely composed of these four failure modes.