Updated
Updated · Ars Technica · Jun 4
Anthropic's Opus 4.7 Tops Estonia's LLM Benchmark With 94.9 Score
Updated
Updated · Ars Technica · Jun 4

Anthropic's Opus 4.7 Tops Estonia's LLM Benchmark With 94.9 Score

1 articles · Updated · Ars Technica · Jun 4

Summary

  • Opus 4.7 earned a 94.9 out of 100 on Estonia's new “Propaganda Resistance” benchmark, with “Exemplary” answers on 77% of prompts and “mediocre” ratings on just 2%.
  • The Estonian Language Institute built the test with defense collective Propastop to measure whether models can resist Russian strategic narratives without web search or other external tools.
  • The benchmark spans 14 influence categories—from Crimea and the war in Ukraine to NATO history and Soviet-era Baltic annexation—and probes models in English, Estonian and Russian with neutral, biased and malicious prompts.
  • Anthropic's Claude family dominated the proprietary frontier field, taking six of the top 10 spots, as governments increasingly scrutinize whether widely used chatbots amplify foreign propaganda.

Insights

Why was the top propaganda-resistant AI blacklisted by the Trump administration for its military use?
How can AI models be reliable if trained on internet archives full of state-sponsored propaganda?

Claude Opus 4.7: 91.5% MMMLU, Market Impact, and the Challenge of Estonian Language Benchmarks

Overview

The evaluation of large language models (LLMs) for smaller languages like Estonian has long been limited by the lack of comprehensive benchmarks. To address this, a new Estonian LLM benchmark was introduced in 2026, built on seven diverse datasets created from native Estonian sources. These datasets ensure authenticity and avoid errors from machine translation, allowing for a thorough assessment of LLMs in areas such as general and domain-specific knowledge, grammar, vocabulary, summarization, and contextual understanding. This initiative is crucial for advancing LLM development for underrepresented languages by providing a standardized way to measure and compare model performance.

...