Forget AGI—Top AI Models Still Struggle With Math
Recent results from the MATHVISTA benchmark show that current AI models, including leading systems like GPT-4 Vision, still lag behind humans in general intelligence tasks requiring mathematical reasoning based on visual information, such as interpreting charts and diagrams. GPT-4 Vision achieved a top score of 49.9%, while humans averaged 60.3%. The test challenges AI models with multimodal problems that go beyond text pattern-matching, focusing on multi-step reasoning using visual and mathematical data. Building the MathVista dataset required specialized annotators to ensure problems demanded deeper reasoning, resulting in over 6,000 examples. Researchers highlight persistent issues in evaluating true capability, due in part to potential data contamination and limited diversity in training data. Some suggest that developing simulated environments for AI models could help overcome current knowledge boundaries. Human evaluators remain important for assessing AI performance and closing the gap between machine and human-level reasoning.

