Leni scores 98% on BullshitBench v2 - above all 142 public model entries [hold until press release is live]

Social Media Post (X / LinkedIn / Blog)

Figures to attach in order:

social_card1_hero.png - opening image
social_card2_leaderboard.png - after the leaderboard numbers
social_card3_firewall.png - after the architecture explanation
social_card4_uplift.png - after the uplift numbers
social_card5_example.png - after the example

Episode 1 .png

Leni just scored 98% on the Bullshit Benchmark v2 - the open adversarial suite that tests whether AI systems challenge nonsensical prompts instead of confidently answering them.

That places it above all 142 model configurations on the public leaderboard, including every version of GPT, Gemini, Grok, and Claude.

The best raw model - Claude Sonnet 4.6 - catches 91%. GPT-5.5 catches 45%. GPT-5.5 Pro catches 35%. Gemini 3 Pro, 48%.

Leni catches 98%.

Here's the counterintuitive finding from the BullshitBench data: models with more reasoning steps actually perform worse at catching fabrications, not better. When a reasoning model encounters a plausible but fake premise, it uses its extra compute to rationalize an answer rather than stopping to question the premise.

More thinking ≠ more skepticism. It means more sophisticated rationalization.

So how does Leni do it?

Not by being smarter. By having a checkpoint.