There's a Benchmark Test That Measures AI 'Bullshit'—Most Models Fail

Summary

BullshitBench is a benchmark designed to test AI models’ ability to detect and reject nonsensical questions across domains like medicine, law, and physics. Each question, while using plausible terminology, contains a broken or meaningless premise. The correct response is to flag the nonsense, but most AI models instead generate confident, detailed answers—demonstrating a specific hallucination problem where models fail to recognize unanswerable questions. The benchmark evaluates AI responses in three categories: Green (clear rejection), Amber (hedging), and Red (accepting nonsense). Testing 82 models, results show Anthropic’s models lead with over 90% correct pushback, while Google’s Gemini and OpenAI models noticeably lag. Alibaba’s Qwen 3.5 is a standout among Chinese models. Findings reveal that model upgrades or higher reasoning capabilities don’t reliably reduce this issue. This hallucination risk has serious implications for real-world use, illustrated by documented AI-related legal and military failures. All data and model responses are available publicly for review.