Latest Blockchain & Cryptocurrency Updates

an hour ago
Source DeCrypt

OpenAI Says Benchmark Used to Measure AI Coding Skill Is 'Contaminated'—Here's Why

Summary

SWE-bench Verified, a widely used benchmark for evaluating AI coding abilities, has been declared unreliable by OpenAI due to flawed test design and extensive training data leakage. Originally designed to measure how well AI models could fix real bugs in open-source Python projects, the benchmark became a key metric for model comparisons, with leading labs touting their high scores as proof of progress. OpenAI found that 59.4% of tasks it audited were broken: many required specific, undisclosed function names or tested irrelevant features, while training data contamination allowed AIs to recall solutions verbatim. Even top models from OpenAI, Anthropic, and Google had seen the answers during training. As a result, OpenAI now recommends using SWE-bench Pro—a newer, less-contaminated benchmark where model performance is dramatically lower (around 23%). OpenAI acknowledges that cycling through benchmarks is a recurring problem as models begin to memorize public test sets. The company is shifting to private, expert-authored tasks to ensure more robust evaluation. This change undercuts recent leaderboard claims and highlights the challenge of fairly assessing coding AI progress.

OpenAI Says Benchmark Used to Measure AI Coding Skill Is 'Contaminated'—Here's Why

Related News

Is Artificial General Intelligence... Is Artificial General Intelligence Already Here? One AI Founder Thinks So

Critics Mock Anthropic's Claims... Critics Mock Anthropic's Claims Chinese AI Labs Are Stealing Its Data

We Talked to an AI Trained on... We Talked to an AI Trained on Jeffrey Epstein’s Emails. Here's What It Said

OpenClaw Creator Bans Bitcoin... OpenClaw Creator Bans Bitcoin, Crypto Chatter After Joining OpenAI

Marketers Could Use AI to Make... Marketers Could Use AI to Make Sure You See Their Ads—Here's How

AMC Theatres Blocks AI Short Film... AMC Theatres Blocks AI Short Film From Screening in Pre-Show Advertising

Latest News!