ARC Prize open-sources analysis package on AI failure modes in ARC-AGI-3

7 articles · Updated · arcprize.org · May 1

Reviewing 160 replays and reasoning traces, it reported GPT-5.5 scored 0.43% and Opus 4.7 0.18% on the semi-private dataset.
The package identifies three shared failure modes: seeing local effects without a global rule, forcing unfamiliar tasks into known game abstractions, and solving levels without learning transferable mechanics.
ARC-AGI-3 contains 135 hand-crafted novel environments, each solved by at least two humans, and has logged more than one million games to audit how frontier AI reasoning may generalise.

If top AIs score below 1% on reasoning tests, what fundamental piece of intelligence are they missing?

If an AI's self-explanation is unreliable, how can we ever truly trust its critical decisions?

Will building AI with a 'sense of consequence' fix the reasoning failures found in today's models?