Mastering Long-Horizon Planning: How GRASP Revolutionizes Gradient-Based World Model Optimization

The Promise and Pitfalls of Learned World Models

In recent years, large-scale learned world models have made remarkable strides. These models can predict sequences of future observations in high-dimensional visual spaces, generalize across diverse tasks, and increasingly resemble general-purpose simulators rather than task-specific predictors. However, possessing a powerful predictive model does not guarantee effective control or planning. When applied to long-horizon tasks, gradient-based planners often struggle with ill-conditioned optimization, poor local minima, and subtle failure modes in high-dimensional latent spaces.

Mastering Long-Horizon Planning: How GRASP Revolutionizes Gradient-Based World Model Optimization — Source: bair.berkeley.edu

The Challenge of Long Horizons

Planning over extended time horizons is a true stress test for any gradient-based planner. The optimization landscape becomes increasingly complex, with non-greedy structures creating deceptive basins. As the horizon lengthens, gradients can vanish or explode, and the planner may fail to discover coherent action sequences that lead to desired outcomes. Traditional methods, such as random shooting or cross-entropy methods, may work for short horizons but become computationally infeasible or ineffective as the number of steps grows.

Introducing GRASP: A Robust Gradient-Based Planner

To address these challenges, we developed GRASP (Gradient-based planning for world models with virtual states, stochasticity, and pruned gradients). This new planner makes long-horizon planning practical through three key innovations that together dramatically improve robustness and efficiency.

Virtual Trajectory Lifting

GRASP lifts the trajectory into a set of virtual states, one per time step. This decouples the optimization across time, enabling parallel updates. Instead of sequentially rolling out the world model, the planner optimizes all time steps simultaneously. This not only speeds up computation but also mitigates the vanishing gradient problem by providing direct feedback to each state.

Stochastic State Iterations for Exploration

The second innovation introduces stochasticity directly into the state iterates during optimization. By adding controlled noise to the virtual states, GRASP encourages exploration of the trajectory space. This prevents the planner from getting trapped in poor local minima and allows it to discover better action sequences, especially in high-dimensional or multimodal dynamics.

Gradient Reshaping for Clean Action Signals

Third, GRASP reshapes the gradients so that actions receive clean, informative signals while avoiding the brittle “state-input” gradients that often arise when gradients flow through high-dimensional vision models. By carefully decoupling the gradient flow, the planner can update actions effectively without being distorted by the visual feature hierarchy.

Why Traditional Approaches Falter

Standard gradient-based planners typically suffer from several weaknesses. The optimization is often ill-conditioned because the loss landscape for long sequences is highly non-convex. Additionally, the need to backpropagate through many time steps of a learned dynamics model leads to exploding or vanishing gradients. In vision-based world models, the high-dimensional latent space introduces further fragility: small changes in latent states can produce large changes in predicted observations, making gradient signals noisy and unreliable. GRASP directly tackles these issues by restructuring the optimization and gradient paths.

Ill-conditioned optimization: Virtual trajectory lifting improves conditioning by parallelizing the update across time.
Vanishing gradients: Stochastic state iterations keep the exploration active and prevent gradient collapse.
Brittle state-input gradients: Gradient reshaping removes the dependency on unreliable visual features, focusing on action-relevant information.

What Is a World Model? (A Working Definition)

The term “world model” has become overloaded. In our context, we define it as a learned model that, given a current state (images, latent vectors, proprioception) and a sequence of future actions, predicts subsequent states. Formally, it approximates the environment’s transition distribution:
\(P_{\theta}(s_{t+1} \mid s_{t-h:t}, a_t)\)
This predictive model is the core of planning: by simulating many action sequences, the planner can select the one that best achieves a goal. GRASP assumes access to such a differentiable world model, which is becoming standard in modern reinforcement learning and robotics.

Internal Navigation

Use the links below to jump to specific sections:

Introducing GRASP
Virtual Trajectory Lifting
Stochastic State Iterations
Gradient Reshaping

Conclusion

GRASP represents a significant step toward making gradient-based planning with learned world models reliable for long-horizon tasks. By addressing the core issues of optimization conditioning, exploration, and gradient quality, our approach opens the door to using powerful world models in complex, real-world scenarios. Future work will extend these ideas to continuous action spaces and multi-task settings, further bridging the gap between simulation and reality.

This work was done in collaboration with Mike Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar.

Tags: