98% Correct Rejection on the Bullshit Benchmark v2 | Above All 142 Public Model Entries | Zero Custom Training
Leni Agent ยท April 2026
Large language models are getting smarter. They are not getting more honest.
Intelligence and epistemic reliability are different capabilities. A model can be extraordinarily fluent, deeply knowledgeable, and demonstrably useful - and still confidently elaborate on a methodology that does not exist. The failure mode is not ignorance. It is sycophantic confabulation: the tendency to treat every premise in a user's question as valid and build on it, even when the premise is fabricated.
This is not an academic concern. In professional environments - legal, financial, medical, engineering - a single confidently wrong answer built on a false premise can cascade into real decisions, real contracts, real patient outcomes. The cost of a hallucinated framework name is not a bad chatbot interaction. It is a due diligence failure, a misdiagnosis, a regulatory citation.
Leni is an AI Business Analyst. On the Bullshit Benchmark v2 [1] - the open adversarial suite that measures whether AI systems challenge nonsensical prompts instead of answering them - Leni achieves a 98% correct rejection rate. This places it above all 142 direct model configurations on the public leaderboard [2], including the top - ranked raw Claude Sonnet 4.6 (91%).
We did not fine - tune for this. We did not train a classifier. We built an epistemic firewall - an agentic architecture that forces structured reflection before response - and it turns out that making an LLM stop and think before it answers is worth more than making it smarter.
Every team deploying LLMs in professional workflows discovers the same uncomfortable truth: model capability and model trustworthiness are weakly correlated.
A frontier model can score at the top of reasoning benchmarks and still fail a basic epistemic test: does this question even make sense? The reason is architectural. Standard LLM inference is a single forward pass - prompt in, completion out. There is no built - in checkpoint that asks, "Wait, is the premise of this question valid?"
This creates a failure mode that scales with intelligence, not against it:
The Fluency Trap. The more capable a model, the more convincingly it can elaborate on nonsense. Ask a weak model about a "triangulated accrual reconciliation method" and you get a vague, obviously confused answer. Ask a frontier model and you get a detailed, well - structured, entirely fabricated implementation guide - complete with plausible configuration parameters. The user walks away more confident and more wrong.
The Authority Bias. Professional questions come wrapped in domain jargon, organizational context, and implied expertise. When a controller asks about configuring "convergence tolerance between three ledger axes," the model's prior is that controllers know what they are talking about. The social frame suppresses skepticism.
The Completion Imperative. LLMs are trained to be helpful. Helpfulness and honesty are aligned most of the time - but they diverge precisely at the point where the correct answer is "that thing you asked about does not exist." Saying "I don't recognize this framework" feels unhelpful. Providing a thoughtful answer within the fabricated frame feels helpful. The training signal pulls in the wrong direction.
The Reasoning Paradox. Perhaps the most counterintuitive finding from the BullshitBench data: models with extended thinking steps - chain - of - thought reasoners that use more inference - time compute - actually perform worse on fabrication detection, not better [3]. When a reasoning model encounters a plausible but wrong premise, it uses its extra compute to construct a path toward some conclusion rather than stopping to reject the premise entirely. More thinking does not mean more skepticism. It means more sophisticated rationalization.
The result: GPT - 5.5 correctly rejects roughly 45% of fabricated premises [4]. GPT - 5.5 Pro fares even worse at around 35%. Gemini 3 Pro, 48%. Even the best raw model configuration - Claude Sonnet 4.6 with high reasoning - catches only 91% [2]. The remaining 9% are not capability failures. They are architecture failures. The model is smart enough to catch the fabrication. It just never stops to check.