Updated
Updated · WIRED · May 26
AI Models Miss Accuracy Benchmarks, With Top Scores Still Below 56%
Updated
Updated · WIRED · May 26

AI Models Miss Accuracy Benchmarks, With Top Scores Still Below 56%

1 articles · Updated · WIRED · May 26

Summary

  • 55.6% was the best score in a recent Google-updated SimpleQA benchmark, where Gemini 2.5 Pro led and no tested model crossed a reliable majority by much.
  • More than 60% of answers from AI-powered search engines were inaccurate in a March 2025 Tow Center study, while a BBC study put chatbot error rates near 45%—suggesting AI is wrong about half the time.
  • 73% accuracy from Claude on RealFactBench was one of the stronger results cited, but performance varied by test and still fell short of making AI dependable for broad real-world fact-checking.
  • Tests by WIRED and fact-checking groups found models could outline how to verify claims but often failed to actually check them, reinforcing the need for humans to assess sources, context and nuance.
  • 60% of researchers surveyed for a 2025 AAAI report doubted AI's factuality problem would be solved soon, even as fact-checkers increasingly use the tools to surface leads for human verification.

Insights

As AI floods the web with convincing lies, can human fact-checkers win a war of attrition against machines?
If human oversight is today's fix, what redesign can make future AI a truly reliable source of truth?
With AI tools often wrong, what new skills must we learn to safely separate fact from digital fiction?

AI Benchmark Accuracy in 2026: Progress, Pitfalls, and the Growing Gap Between Metrics and Real-World Trust

Overview

Between 2025 and 2026, AI benchmark accuracy has become a central focus, with new tools and tests helping policymakers, developers, and users better understand AI’s true capabilities and risks. Reports like Stanford’s AI Index highlight these efforts, showing how benchmarks such as SWE-bench Verified push AI systems to solve real-world coding problems. Notably, Claude Code achieved top performance on this test, outpacing GitHub Copilot. These advancements reflect a broader trend: as benchmarks become more sophisticated, they provide clearer insights into AI progress, helping guide responsible development and practical deployment across industries.

...