Updated

Updated · NVIDIA · Jun 15

World-Action Models Emerge as 2nd Robot Foundation Recipe as 14B Video Backbones Challenge VLAs

Updated

Updated · NVIDIA · Jun 15

World-Action Models Emerge as 2nd Robot Foundation Recipe as 14B Video Backbones Challenge VLAs

2 articles · Updated · NVIDIA · Jun 15

WAMs are moving from a niche idea to a mainstream robot-model paradigm, with the report arguing they now stand alongside VLM-based VLAs as the field’s second major foundation-model recipe.
14B-class and 5B-class video backbones such as Wan, plus newer world models like Cosmos, have made that shift practical by giving robot policies pretrained priors over scene dynamics rather than forcing language-to-action grounding from robot data alone.
DreamZero’s 1,750 RoboArena score versus Pi-0.5’s 1,622 is cited as an early real-world signal that WAMs can compete, while systems such as LingBot-VA, Cosmos Policy and Fast-WAM show the approach is branching into multiple formulations.
That progress comes with steep trade-offs: DreamZero-style action tuning is estimated near 9 ZFLOPs, a full Wan-scale WAM stack around 51 ZFLOPs, and common WAM inference modes at roughly 590-800 ms per action chunk versus about 190 ms for Pi-0.5.
The report’s bottom line is that WAMs are likely here to stay, but the eventual winner may be hybrid VLA-WAM systems that combine language understanding, world modeling and action generation rather than pure versions of either approach.

Sources

NVIDIA1d ago

World-Action Models (WAMs) Emerge as Key Robot Foundation Model Paradigm, Challenging VLAs

theSun1d ago

Ace Robotics' kairos world model leads multiple global embodied-intelligence benchmarks

Is learning from human video or the new WAM architecture the bigger breakthrough for creating general-purpose robots?

With WAMs being slower and more expensive, can they truly outperform VLAs in cost-sensitive industries like logistics?

As robot AI models converge, will the future be a hybrid brain or something entirely new beyond today's designs?

World-Action Models Overtake Vision-Language-Action: The 10x Leap in Robotics Foundation Models (2026 Report)

Overview

Robotics foundation models are experiencing a major shift from Vision-Language-Action (VLA) models to World-Action Models (WAMs). This change is driven by the limitations of VLAs, which cannot predict how their actions will affect the world or understand physical causality. These weaknesses create a ceiling for Physical AI, stopping robots from fully grasping the results of their actions. WAMs overcome these issues by learning from large-scale video data, allowing robots to better anticipate and plan for future outcomes. This new approach promises smarter, more adaptable robots that can interact with the world in a deeper, more meaningful way.

...