71.6% Normalized Score on DRACO | #1 Globally | Beating Perplexity, Gemini, and OpenAI Deep Research
Leni Agent · April 2026
Deep research AI has a delivery problem.
Every major lab has shipped a "deep research" agent. They can search the web, synthesize sources, and produce multi-page reports. On the retrieval side — finding relevant information and stating it accurately — the best systems have converged. Perplexity, Gemini, OpenAI, and Leni all achieve comparable factual accuracy on complex research tasks. The race to retrieve is largely won.
But retrieval is not research. Research is what happens after you find the information: structuring the argument, citing sources correctly, using precise terminology, formatting for the reader, and producing output that a professional would trust enough to forward to a colleague. This is the delivery layer — and it is where most deep research agents fall apart.
Leni is an AI Business Analyst — not a deep research product. It handles spreadsheets, documents, presentations, and real estate analytics. But its production harness includes two tools that, together, turn a general-purpose agent into a competitive research system: a web search agent that retrieves and synthesizes across multiple sources, and a research validator that judges the draft across quality criteria and can send the agent back for additional search rounds before the user ever sees the output.
On the DRACO Benchmark [1] — the production-grounded deep research evaluation developed by Perplexity AI and Harvard — this general-purpose business analyst achieves a 71.6% normalized score, ranking first globally. It outperforms Perplexity Deep Research (70.5%), Gemini Deep Research (59.0%), and OpenAI Deep Research (52.1%) — all purpose-built deep research systems.
The margin over Perplexity on factual accuracy is negligible (66.7% vs. 66.5%). The margin on presentation quality is 11.5 percentage points (94.0% vs. 82.5%). The margin on citation quality is 8.4 points (86.6% vs. 78.2%). Leni does not win by knowing more. It wins by delivering better.
This paper describes why the delivery gap exists, how a general-purpose business analyst with the right tools closes it, and what the DRACO results tell us about where deep research is actually heading.
Every team building deep research agents discovers the same pattern. The first 80% of performance comes from better retrieval — more sources, better ranking, smarter synthesis. The last 20% comes from everything retrieval does not touch.
DRACO measures four dimensions: Factual Accuracy, Breadth and Depth of Analysis, Citation Quality, and Presentation Quality. The first two reward finding and synthesizing information. The last two reward delivering it. And the leaderboard tells a clear story: every system on the board is closer together on retrieval than on delivery.
This is not a coincidence. It reflects three structural problems with retrieval-first architectures:
The Single-Pass Problem. Most deep research agents follow a linear pipeline: search → synthesize → output. The synthesis step produces a draft; the draft becomes the final answer. There is no quality gate between "I found the information" and "I delivered the research." If the draft has a citation error, a structural flaw, or a missing trade-off analysis, nobody catches it. The user is the first reviewer.
The Presentation Afterthought. Retrieval-first architectures treat formatting as a cosmetic pass — something that happens to the content after it is generated. But professional research presentation is not cosmetic. It is structural. The choice to lead with a conclusion vs. build to it, to use a numbered taxonomy vs. flowing prose, to surface a citation inline vs. in a footnote — these are decisions that affect how the reader processes and trusts the content. When presentation is an afterthought, it shows.
The Citation Integrity Gap. Correctly attributing claims to sources sounds simple. In practice, it is one of the hardest problems in deep research. The agent must track which claim came from which source across a multi-page synthesis, ensure that paraphrased claims are attributed to the right document, avoid fabricating citations that "look right," and handle cases where multiple sources partially support a claim. Retrieval engines optimize for finding the right sources. They do not optimize for citing them correctly in the output.
The result: on DRACO, the top systems are separated by less than 1 point on Factual Accuracy — but by 11.5 points on Presentation Quality and 8.4 points on Citation Quality. The retrieval race has converged. The delivery race has barely begun.