OpenAI Says Benchmark Used to Measure AI Coding Skill Is 'Contaminated'—Here's Why
SWE-bench Verified, a widely used benchmark for evaluating AI coding abilities, has been declared unreliable by OpenAI due to flawed test design and extensive training data leakage. Originally designed to measure how well AI models could fix real bugs in open-source Python projects, the benchmark became a key metric for model comparisons, with leading labs touting their high scores as proof of progress. OpenAI found that 59.4% of tasks it audited were broken: many required specific, undisclosed function names or tested irrelevant features, while training data contamination allowed AIs to recall solutions verbatim. Even top models from OpenAI, Anthropic, and Google had seen the answers during training. As a result, OpenAI now recommends using SWE-bench Pro—a newer, less-contaminated benchmark where model performance is dramatically lower (around 23%). OpenAI acknowledges that cycling through benchmarks is a recurring problem as models begin to memorize public test sets. The company is shifting to private, expert-authored tasks to ensure more robust evaluation. This change undercuts recent leaderboard claims and highlights the challenge of fairly assessing coding AI progress.

