we gave our LLM eyes. it hit 91.25%.

Leni hit 91.25% on SpreadsheetBench Verified - the NeurIPS 2024 benchmark that everyone uses to measure spreadsheet agents. 365 out of 400 tasks. second place globally.

we didn't build a custom spreadsheet engine. we didn't train a model. we didn't write a neurosymbolic verifier.

we gave the LLM a way to see what its formulas actually evaluate to. that's it. that's the whole thing.

here's what we learned.

fig1_architecture_pipeline.png


the problem is not capability. it's blindness.

Claude Opus 4.6 can reason through complex spreadsheet logic. it can write SUMPRODUCT formulas, handle multi-sheet lookups, transform data across hundreds of rows. it's genuinely good at this.

it scores 80.25% on SpreadsheetBench with a 3-line prompt.

that sounds high until you realize the last 20% is where everything breaks. and the failures aren't reasoning failures. the model knew the right answer. it just couldn't see that its output was wrong.

three failure modes dominate:

the engine gap. Excel and LibreOffice are not the same engine. a formula that works in Excel can silently return #NAME? in LibreOffice. the LLM has zero visibility into this. it wrote a correct formula. the file contains an error. nobody told it.

the serialization trap. programmatic spreadsheet editing through openpyxl introduces silent corruption. row deletions shift formula references. cell merges create phantom ranges. array formula syntax gets corrupted during serialization. the file looks fine. it isn't.

the determinism illusion. same prompt, same file, same task - different output. you can't prompt your way to determinism. a formula that passed yesterday fails today because the model picked a slightly different function signature.

every production team building LLM spreadsheet tools lives in this gap. the demos work. the edge cases don't.


we solved it with three layers

not a better prompt. a compound system.

layer 1: the agentic scaffold