Latest Blockchain & Cryptocurrency Updates

14 hours ago
Source DeCrypt

There's a Benchmark Test That Measures AI 'Bullshit'—Most Models Fail

Summary

BullshitBench is a benchmark designed to test AI models’ ability to detect and reject nonsensical questions across domains like medicine, law, and physics. Each question, while using plausible terminology, contains a broken or meaningless premise. The correct response is to flag the nonsense, but most AI models instead generate confident, detailed answers—demonstrating a specific hallucination problem where models fail to recognize unanswerable questions. The benchmark evaluates AI responses in three categories: Green (clear rejection), Amber (hedging), and Red (accepting nonsense). Testing 82 models, results show Anthropic’s models lead with over 90% correct pushback, while Google’s Gemini and OpenAI models noticeably lag. Alibaba’s Qwen 3.5 is a standout among Chinese models. Findings reveal that model upgrades or higher reasoning capabilities don’t reliably reduce this issue. This hallucination risk has serious implications for real-world use, illustrated by documented AI-related legal and military failures. All data and model responses are available publicly for review.

There's a Benchmark Test That Measures AI 'Bullshit'—Most Models Fail

Related News

China Plays the Long Game in AI... China Plays the Long Game in AI While US Chases Superintelligence: Brookings

Meta Acquires Moltbook, the Viral... Meta Acquires Moltbook, the Viral Social Network for AI Agents: Report

Anthropic Sues Trump Admin Over... Anthropic Sues Trump Admin Over 'Supply Chain Risk' Designation

Vienna-based Startup Launches AI... Vienna-based Startup Launches AI Pipeline Builder for Gaming Studios

'Obscene': Grammarly's New AI Tool... 'Obscene': Grammarly's New AI Tool Offers Writing Feedback From Dead Scholars

Nvidia Is Probably Done Investing... Nvidia Is Probably Done Investing in OpenAI and Anthropic, Says CEO—Why?

Latest News!