619 annotated chemistry reasoning tasks showed AI agents ignored evidence at least once in 68% of cases, according to an arXiv study focused on how they reason, not just whether they answer correctly.
53% of tasks included unsupported claims, and the agents revised their output when faced with contradictory evidence only 26% of the time, exposing a weak ability to update hypotheses after experiments.
The benchmark tested agents built on three LLMs and two tool-using setups, including systems that could access simulated experiments and some real lab equipment.
Researchers said that makes current AI more suitable for narrow, well-defined scientific jobs than for open-ended inquiry, despite industry claims that newer reasoning models can think through problems like scientists.
AI can now expertly rationalize its own mistakes. How do we trust systems when their 'thinking' is just a convincing illusion?
Since scaling up AI doesn't fix its flawed reasoning, what breakthrough is needed for it to actually understand the world?
Why AI Agents Fail at Scientific Reasoning: Diagnostic Accuracy Stalls at 52% and the Urgent Need for True Learning
Overview
Recent research shows that AI agents struggle with genuine scientific reasoning because their current architectures mainly rely on 'lookup' methods rather than true learning. These systems mimic only the fast, hippocampal part of human memory and miss the slow, deep learning needed for real expertise. As a result, AI agents simply hoard data without adapting or updating their knowledge, making them vulnerable to errors and unable to solve new, complex problems. This leads to clear performance gaps, especially in scientific and diagnostic tasks, where AI accuracy remains much lower than that of human experts.