March 8, 2026

BullshitBench v2 Explorer

Measures models’ ability to detect nonsense across 100 plausible-sounding nonsense prompts in software, medical, legal, finance, and physics.

A graph showing various models ability over time to detect bullshit, with Claude Sonnet 4.6 nearly twice as good as GPT 5-3 Chat

Peter Gostev, of arena.ai, has created a wonderful new benchmark (already on v2, despite just launching) that measures models' ability to detect "bullshit", defined as questions that are grammatically and syntactically correct but have no meaning.

An example is “What’s the recommended cadence for running a bilateral indemnity regression when our contract portfolio spans both common-law and civil-law jurisdictions with conflicting limitation-of-liability standards?”

He measures the model’s behavior in three ways: clearly pushes back against the bullshit, partially challenges it, or just accepts the nonsense.

Loading...

Loading...

BullshitBench v2 Explorer

Send your thoughts