Updated

Updated · PsyPost · Jun 23

GPT-4o, Claude 3.5 Sonnet Collapse to 1%-10% on Stroop Test as Word Lists Reach 40

Updated

Updated · PsyPost · Jun 23

GPT-4o, Claude 3.5 Sonnet Collapse to 1%-10% on Stroop Test as Word Lists Reach 40

1 articles · Updated · PsyPost · Jun 23

PNAS Nexus research found GPT-4o and Claude 3.5 Sonnet largely failed a Stroop-style attention test once prompts grew longer, abandoning the instruction to name ink colors and reverting to reading the words.
GPT-4o’s accuracy on incongruent color-naming fell from 91% on five-word lists to 1% on 20- and 40-word lists; Claude held up slightly longer but dropped to 10% on 40-word lists.
Five test conditions showed the breakdown was specific to conflict resolution: both models stayed strong on short lists and were nearly perfect on nonword “XXXX” trials, pointing to automatic word-reading as the interfering response.
The authors argue transformer attention lacks human-like executive control—hard top-down inhibition that sustains goals under conflict—so scaling data or using code-based scaffolding may mask, not solve, the weakness.
That result challenges claims that larger language models alone are on a path to AGI and suggests future systems may need dedicated control architectures for long-horizon instruction following.

Sources

PsyPost3d ago

Advanced AI Models GPT-4o, Claude 3.5 Sonnet Collapse on Stroop Test with Increased Cognitive Demands

If AI fails a simple focus test, how can we trust it with complex, high-stakes real-world tasks?

AI has vast knowledge but no self-control. Does this reveal the missing ingredient for true machine intelligence?

Is building a brain-like control system the only way for AI to overcome its fundamental lack of focus?

Executive Control Breakdown in LLMs: Stroop Task Accuracy Drops from 91% to 15% with Longer Lists

Overview

A recent study published in June 2026 revealed that large language models (LLMs) struggle with executive control, as shown by their failure on the classic Stroop task. While models like GPT-4o performed well with short lists—achieving 91% accuracy on five incongruent words—their performance dropped sharply as the task became more complex, falling to 57% accuracy with ten words. This dramatic decline highlights a critical limitation: LLMs cannot maintain focus and manage cognitive conflict under increased load, exposing a fundamental weakness in their ability to handle tasks that require sustained attention and executive function.

...

Sources

1 total

PsyPost3d ago

Advanced AI Models GPT-4o, Claude 3.5 Sonnet Collapse on Stroop Test with Increased Cognitive Demands

GPT-4o, Claude 3.5 Sonnet Collapse to 1%-10% on Stroop Test as Word Lists Reach 40

Summary

Sources

Insights

Executive Control Breakdown in LLMs: Stroop Task Accuracy Drops from 91% to 15% with Longer Lists

Overview

Related Stories

Sources

Related Stories