Updated

Updated · ZDNet · Jun 2

Claude Opus 4.8 Stumbles on 1 Legal Test as 10-Prompt Review Finds Better Honesty

Updated

Updated · ZDNet · Jun 2

Claude Opus 4.8 Stumbles on 1 Legal Test as 10-Prompt Review Finds Better Honesty

3 articles · Updated · ZDNet · Jun 2

A 10-prompt evaluation found Claude Opus 4.8 generally outperformed Opus 4.7, but a legal-insurance demand-letter test exposed a major judgment error when it treated Oregon-specific guidance as justified without enough evidence.
The failure emerged during a cross-check of scoring, when Opus 4.8 defended Opus 4.7's Oregon-law reasoning even though the prompt identified only the user's location, not the father's, leaving jurisdiction unresolved.
On other traps, 4.8 showed clearer gains: it avoided overconfident debugging claims and refused to supply fabricated Alzheimer's citations, areas where 4.7 overreached.
The reviewer used 4 AI systems to build and verify the results and concluded 4.8 is a real but modest upgrade over 4.7—better calibrated, still far from reliable enough to trust on legal judgment.

Sources

Center100%

If a 'smarter' AI can rationalize its own errors, does that make it more or less trustworthy for critical tasks?

What hidden trade-offs are we accepting when 'safer' AI models backslide on other key safety metrics?