Updated

Updated · Marcus on AI | Gary Marcus | Substack · May 10

METR Says AI Reaches 16-Hour Coding Tasks at 50% Success as Panic Outruns the Data

Updated

Updated · Marcus on AI | Gary Marcus | Substack · May 10

METR Says AI Reaches 16-Hour Coding Tasks at 50% Success as Panic Outruns the Data

6 articles · Updated · Marcus on AI | Gary Marcus | Substack · May 10

METR’s latest “time horizon” graph shows frontier AI models can complete software-development tasks that take humans 16 hours, but only at a 50% success threshold.
That headline result fueled alarm over models such as Mythos, yet the report argues the benchmark looks far less dramatic at 80% success and says reliability—not occasional wins—remains the central weakness.
The graph also measures only coding tasks, not broad human-level intelligence, and likely reflects gains from tools such as code interpreters, verification and harnesses as much as raw model scaling.
Long-run extrapolations are the bigger problem: the analysis warns AI progress will not keep doubling indefinitely, citing possible limits from chips, energy, benchmark-focused optimization and weaker performance on less formal tasks.
Outside coding and math, the piece argues replacement of full human jobs is still limited for now, with broad online-task performance and physical work capability likely well below the coding benchmark.

As AI 'breaks' benchmarks, how do we measure its true power when our yardsticks are obsolete?

Anthropic deemed its AI too dangerous for public release. Who decides when a technology becomes this powerful?

If AI can find decades-old security flaws, are any digital systems truly safe from autonomous cyberattacks?