Leni scores 98% on BullshitBench v2 - above all 142 public model entries [hold until press release is live]

Social Media Post (X / LinkedIn / Blog)

Figures to attach in order:

  1. social_card1_hero.png - opening image
  2. social_card2_leaderboard.png - after the leaderboard numbers
  3. social_card3_firewall.png - after the architecture explanation
  4. social_card4_uplift.png - after the uplift numbers
  5. social_card5_example.png - after the example

Episode 1 .png

Leni just scored 98% on the Bullshit Benchmark v2 - the open adversarial suite that tests whether AI systems challenge nonsensical prompts instead of confidently answering them.

That places it above all 142 model configurations on the public leaderboard, including every version of GPT, Gemini, Grok, and Claude.

The best raw model - Claude Sonnet 4.6 - catches 91%. GPT-5.5 catches 45%. GPT-5.5 Pro catches 35%. Gemini 3 Pro, 48%.

Leni catches 98%.

ai-edited-image.png

Here's the counterintuitive finding from the BullshitBench data: models with more reasoning steps actually perform worse at catching fabrications, not better. When a reasoning model encounters a plausible but fake premise, it uses its extra compute to rationalize an answer rather than stopping to question the premise.

More thinking ≠ more skepticism. It means more sophisticated rationalization.

So how does Leni do it?

Not by being smarter. By having a checkpoint.