Huawei's New Benchmark Gives AI Agents Months of Your Life—Then Watches Them Fail

Summary

AI personal assistants are marketed as tools that can manage all aspects of a user's digital life with minimal input. To test this claim, researchers developed the Claw-Anything benchmark, which evaluates AI agents on tasks that reflect real-world complexity: long-term event streams, workflows involving an average of 10.1 interdependent backend services per task, and interactions across multiple devices and platforms (Linux CLI, Android GUI). Unlike traditional benchmarks with short and focused tasks, Claw-Anything uses context windows averaging 191,700 words. AI agents performed poorly; OpenAI’s GPT-5.5 achieved only 34.5% on first-try task completion (“pass@1”). Agents struggled even more with proactive assistance, averaging just 6.7% compared to 25.9% on reactive tasks. The benchmark highlights that current AIs fail at real-world assistant duties, especially when coordination across services and data sources is required. Fine-tuning improved performance, but significant challenges remain. Claw-Anything exposes limitations ignored by standard benchmarks and is available open-source for further research. Cross-service coordination was identified as the major hurdle toward effective AI personal assistants.