Figures to attach in order:
social_card1_hero.png - opening imagesocial_card2_leaderboard.png - after the leaderboard numberssocial_card3_firewall.png - after the architecture explanationsocial_card4_uplift.png - after the uplift numberssocial_card5_example.png - after the example
Leni just scored 98% on the Bullshit Benchmark v2 - the open adversarial suite that tests whether AI systems challenge nonsensical prompts instead of confidently answering them.
That places it above all 142 model configurations on the public leaderboard, including every version of GPT, Gemini, Grok, and Claude.
The best raw model - Claude Sonnet 4.6 - catches 91%. GPT-5.5 catches 45%. GPT-5.5 Pro catches 35%. Gemini 3 Pro, 48%.
Leni catches 98%.

Here's the counterintuitive finding from the BullshitBench data: models with more reasoning steps actually perform worse at catching fabrications, not better. When a reasoning model encounters a plausible but fake premise, it uses its extra compute to rationalize an answer rather than stopping to question the premise.
More thinking ≠ more skepticism. It means more sophisticated rationalization.
So how does Leni do it?
Not by being smarter. By having a checkpoint.