91.25% on SpreadsheetBench Verified | 2nd Place Global | Zero Custom Engines
Leni Agent ยท April 2026
Large language models can reason about spreadsheets. They can write formulas, transform data, and follow complex instructions. But they are fundamentally blind to their own output. An LLM cannot evaluate the formula it just wrote. It cannot see that its SUMPRODUCT returns #NAME? in a different engine. It cannot observe that deleting row 14 silently invalidated the reference in row 30.
Leni is an AI Business Analyst. Its spreadsheet agent achieves 91.25% (365/400) on SpreadsheetBench Verified - the NeurIPS 2024 gold-standard benchmark - ranking second globally. We did this without building a custom spreadsheet engine, without neurosymbolic infrastructure, and without specialized training. Instead, we solved a harder problem: we gave the LLM a way to see.
This paper describes the three-layer architecture behind that result: an agentic scaffold that encodes domain expertise, a self-verification protocol that forces independent observation, and a closed-loop recalculation feedback system that gives the agent ground truth about its own formulas. Together, they transform a general-purpose frontier model into a production-grade spreadsheet agent.
Every team building LLM-powered spreadsheet tools hits the same wall. The demos look magical. The edge cases are brutal.
The core issue is not capability - frontier models like Claude Opus 4.6 can reason through complex spreadsheet logic. The core issue is observability. When an LLM writes a formula and saves a file, it has no idea what that formula actually evaluates to. It is writing into the void. The file is a black box.
This creates three failure modes that prompt optimization alone cannot solve:
The Engine Gap. Excel, LibreOffice, and openpyxl are not the same engine. They support different function sets, handle array evaluation differently, and disagree on type coercion, date serialization, and error propagation. A formula that works perfectly in one environment can silently return zero, null, or a cryptic error string in another. The LLM has no way to know this happened.
The Serialization Trap. Programmatic spreadsheet editing introduces a class of silent corruption that never occurs in interactive use. Row mutations shift formula references. Cell merges create phantom ranges. The serialization layer can corrupt array formula syntax in ways that are syntactically valid but semantically wrong. The LLM wrote the correct logic; the file contains something different.
The Determinism Illusion. LLMs are stochastic. The same prompt, the same file, the same task - different output. A formula that passed yesterday fails today because the model chose a slightly different function, a different cell reference pattern, a different computation order. You cannot prompt your way to determinism.
Bare Claude Opus 4.6 scores 80.25% on SpreadsheetBench Verified with a minimal prompt. That last 20% is where every production team lives - and it is almost entirely composed of these three failure modes.
We did not solve these problems by writing a better prompt. We solved them by building an agentic compound system - a layered architecture where the LLM is one component in a larger feedback loop. Each layer addresses a specific failure mode that the LLM cannot solve alone.
