Updated
Updated · ZDNet · Jun 2
Claude Opus 4.8 Stumbles on 1 Legal Test as 10-Prompt Review Finds Better Honesty
Updated
Updated · ZDNet · Jun 2

Claude Opus 4.8 Stumbles on 1 Legal Test as 10-Prompt Review Finds Better Honesty

3 articles · Updated · ZDNet · Jun 2
  • A 10-prompt evaluation found Claude Opus 4.8 generally outperformed Opus 4.7, but a legal-insurance demand-letter test exposed a major judgment error when it treated Oregon-specific guidance as justified without enough evidence.
  • The failure emerged during a cross-check of scoring, when Opus 4.8 defended Opus 4.7's Oregon-law reasoning even though the prompt identified only the user's location, not the father's, leaving jurisdiction unresolved.
  • On other traps, 4.8 showed clearer gains: it avoided overconfident debugging claims and refused to supply fabricated Alzheimer's citations, areas where 4.7 overreached.
  • The reviewer used 4 AI systems to build and verify the results and concluded 4.8 is a real but modest upgrade over 4.7—better calibrated, still far from reliable enough to trust on legal judgment.
If a 'smarter' AI can rationalize its own errors, does that make it more or less trustworthy for critical tasks?
What hidden trade-offs are we accepting when 'safer' AI models backslide on other key safety metrics?