Four AI Systems Pass 7 of 10 Math Problems in First Proof Benchmark
Updated
Updated · The Washington Post · Jun 14
Four AI Systems Pass 7 of 10 Math Problems in First Proof Benchmark
1 articles · Updated · The Washington Post · Jun 14
Summary
Seven of 10 unpublished research problems received a passing AI solution in First Proof’s second benchmark, based on work from four systems graded by 30 mathematicians at Harvard.
The project used privately solved but unpublished problems to test AI under controlled conditions and give mathematicians a more transparent check on company claims about breakthrough performance.
First Proof said some answers were flawless or novel, one used a strategy that impressed referees, while other attempts failed or needed minor revisions.
The results land weeks after OpenAI said an internal model disproved an 80-year-old Erdős conjecture, intensifying debate over whether AI is a threat to mathematics or a powerful but limited tool.
Researchers behind the benchmark said models still lag humans in choosing worthwhile questions, setting broader agendas and failing gracefully when a proof attempt breaks down.
With AI solving problems humans can't, what is the future role for mathematicians?
As AI conquers math's biggest challenges, who will control the tools that define truth?
If an AI proves a theorem that no human can understand, is it still mathematical progress?
"First Proof Benchmark: How AI is Reshaping Mathematical Discovery and Human Collaboration (2026)"
Overview
The "First Proof" benchmark, launched in early 2026, marks a major step in measuring AI's true mathematical abilities. Designed by leading mathematicians, it uses real, unpublished research problems to test AI systems in a fair and challenging way. This approach ensures that AI is evaluated on novel and complex questions, not just recycled or artificial ones. While the benchmark shows that AI can solve some advanced problems, it also highlights key challenges—such as the difficulty of communicating with AI and the risk of machines producing many incorrect proofs. These findings reveal both the promise and the limits of AI in mathematics.