we gave our business analyst a validator. it hit 71.6%. [hold until press release is live]

Leni hit 71.6% on the DRACO Benchmark — the production-grounded deep research evaluation developed by Perplexity AI and Harvard. first place globally above Perplexity, Gemini, and OpenAI's purpose-built deep research agents.

we didn't build a custom research engine. we didn't train a new model. we didn't write a specialized retrieval pipeline.

we gave the LLM a way to validate its own output before the user sees it. that's it. that's the whole thing.

here's what we learned.

fig2_toolchain_v4.png

the problem is not retrieval. it's delivery.

every major lab has shipped a "deep research" agent. they can search the web, synthesize sources, and produce multi-page reports. on the retrieval side — finding relevant information and stating it accurately — the best systems have converged.

they score nearly identically on factual accuracy.

that sounds great until you realize the last 20% is where everything breaks. and the failures aren't retrieval failures. the model found the right information. it just couldn't deliver it professionally.

three failure modes dominate:

the single-pass problem. most deep research agents follow a linear pipeline: search → synthesize → output. there is no quality gate between "I found the information" and "I delivered the research." if the draft has a citation error or a structural flaw, nobody catches it. the user is the first reviewer.

the presentation afterthought. retrieval-first architectures treat formatting as a cosmetic pass. but professional research presentation is structural. the choice to lead with a conclusion, use a numbered taxonomy, or surface a citation inline affects how the reader processes and trusts the content. when presentation is an afterthought, it shows.

the citation integrity gap. correctly attributing claims to sources is one of the hardest problems in deep research. the agent must track which claim came from which source across a multi-page synthesis. retrieval engines optimize for finding the right sources. they do not optimize for citing them correctly in the output.

every production team building deep research tools lives in this gap. the retrieval works. the delivery doesn't.

we solved it with two tools

not a better search engine. a compound system.

tool 1: the web search agent

the web search agent is a standard tool in Leni's harness. given a complex query, it decomposes the question, searches across multiple sources, evaluates credibility, and synthesizes findings.

what makes it effective is not that it was designed for deep research. it is that it was designed for iterative use — producing output that another tool can evaluate, critique, and send back with specific feedback.