1 articles · Updated · magazine.sebastianraschka.com · May 16
Four recent open-weight LLM releases—Gemma 4, Laguna XS.2, ZAYA1-8B and DeepSeek V4—are converging on one goal: making long-context inference cheaper without simply shrinking model size.
128K-context savings are central to the review: Gemma 4’s cross-layer KV sharing reuses key-value states across later layers, cutting KV cache roughly in half and saving about 2.7 GB on E2B and 6 GB on E4B.
40-layer Laguna XS.2 varies attention budgets by layer, while ZAYA1-8B runs attention directly in a compressed latent space with convolutional mixing to reduce both cache use and attention FLOPs.
1M-token efficiency is most dramatic in DeepSeek V4, where mHC widens residual pathways and CSA/HCA compress sequence history; the paper says V4-Pro uses 27% of V3.2’s inference FLOPs and 10% of its KV cache.
The broader takeaway is that transformer blocks are becoming more specialized and far more complex, with efficiency gains increasingly coming from targeted attention and memory tweaks rather than wholesale architectural replacement.
New AIs can process a million tokens, but what are the hidden costs of their 'efficient' memory?
Is the AI industry just patching a flawed model, or is the transformer architecture built to last?
As open-source AI rivals proprietary giants, is the billion-dollar moat of closed-source models evaporating?