LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Abstract

Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48× faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.

TL;DR: LeWM is a JEPA-based world model that avoids representation collapse using a simple Gaussian regularizer (SIGReg), trains end-to-end from pixels with only two loss terms, and achieves competitive control performance at a fraction of the compute cost.

Approach

Model Architecture. LeWM is built upon two components: an encoder and a predictor. The encoder maps a given frame observation $\vo_t$ into a compact, low-dimensional latent representation $\vz_t$. The predictor models the environment dynamics in latent space by predicting the embedding of the next frame observation $\hat{\vz}_{t+1}$ given the latent embedding $\vz_t$ and an action $\va_t$.

$$\text{LeWorldModel} \left\{\begin{aligned} \text{Encoder:} \quad & \vz_t = \enc(\vo_t) \\ \text{Predictor:} \quad & \hat{\vz}_{t+1} = \pred(\vz_t, \va_t) \end{aligned}\right.$$

Training Objective. The complete LeWM training objective combines a classical prediction loss $\gL_{\rm pred}$ with a regularization term:

$$\gL_{\rm LeWM} \triangleq \gL_{\rm pred} + \lambda\,\mathrm{SIGReg}(\mZ)$$

The prediction loss $\gL_{\rm pred}$ is a standard latent prediction loss. SIGReg is a regularization enforcing a Gaussian distribution of the latent space; we refer to LeJEPA for details.

Planning with LeWM

LeWM plans purely from pixels, with no proprioceptive information used at any stage. At test time, LeWM encodes a start and goal image into latent space, then uses the Cross-Entropy Method to optimize an action sequence by rolling out candidates through the predictor and picking those whose final embedding lands closest to the goal. Because each frame is encoded as a single 192-dim token (roughly 200× fewer tokens than DINO-WM), planning completes in about ~1 second versus 47 seconds for DINO-WM, a 48× speedup. We perform ablations on several design choices of LeWM and find that LeWM reaches similar performances while being orders of magnitude more efficient than DINO-WM.

Efficient Planning

Planning performance at a fixed compute budget (FLOPs). LeWM achieves competitive results with a fraction of the computation required by baselines.

Planning Results

Planning performance across four environments: Two-Room (2D navigation), Reacher (2-joint arm control), Push-T (block manipulation), and OGBench-Cube (3D robotic pick-and-place). LeWM outperforms PLDM on all challenging tasks and surpasses DINO-WM on Push-T and Reacher, even without pre-trained features. On Push-T, LeWM beats DINO-WM even when DINO-WM uses additional proprioceptive inputs. DINO-WM retains an edge on the visually complex 3D OGBench-Cube task, likely due to richer visual priors from large-scale pretraining. LeWM underperforms on Two-Room; we suspect this is due to the intrinsic dimensionality of the task being too low, which may hinder the Gaussian regularizer from producing a well-structured latent space.

Two-Room

Reacher

Push-T

OGBench Cube

Additional qualitative rollouts for each environment are shown below, including both success and failure cases. Each clip shows two frames side by side: left is the planning rollout and right is the visual goal.

Two-Room

Success

Failure

Reacher

Success

Failure

Push-T

Success

Failure

OGBench Cube

Success

Failure

Evaluating Physical Understanding

Physical Latent Probing

We evaluate which physical quantities are recoverable from LeWM's latent representations by training a lightweight supervised probe to predict physical quantities of interest from latent embeddings.

Physical latent probing results on Push-T. LeWM consistently outperforms PLDM while remaining competitive with DINO-WM. The strong probing performance of DINO-WM on certain properties may stem from its foundation-model pretraining: the DINOv2 encoder is trained on two orders of magnitude more data (∼124M images) spanning a far more diverse distribution, which likely allows it to capture some physical properties in its embeddings by default.

Model	Agent Location				Block Location				Block Angle
	Linear		MLP		Linear		MLP		Linear		MLP
	MSE ↓	r ↑	MSE ↓	r ↑	MSE ↓	r ↑	MSE ↓	r ↑	MSE ↓	r ↑	MSE ↓	r ↑
DINO-WM	1.888	0.977	0.003	0.999	0.006	0.997	0.002	0.999	0.050	0.979	0.009	0.995
PLDM	0.090	0.955	0.014	0.993	0.122	0.938	0.011	0.994	0.446	0.745	0.056	0.972
LeWM	0.052	0.974	0.004	0.998	0.029	0.986	0.001	0.999	0.187	0.902	0.021	0.990

Decoding Latent Space

To visualize the predictions made by LeWM, we train a lightweight decoder (used only for visualization, not during training) to reconstruct images from the CLS token embedding. For each environment, we report: (i) the original video, (ii) the video obtained by encoding and decoding each frame independently, and (iii) the video of latent predictions produced by the world model when conditioned on an action sequence. We can see that LeWM captures the most important parts of the scene and reproduces environment dynamics.

Two-Room

Reacher

Push-T

OGBench Cube

Visualizing Latent Space

A t-SNE visualization of the latent space in the PushT environment suggests that the learned representation captures the spatial structure of the environment, preserving neighborhood relationships and relative positions in the latent space.

LeWM

Violation-of-Expectation

To evaluate the physical understanding of our models, we follow the violation-of-expectation paradigm, where a world model should assign higher surprise to events that contradict learned physical regularities.

PushT

Unperturbed

Block color change

Teleportation

Surprise scores

Cube

Unperturbed

Cube color change

Teleportation

Surprise scores

Citation (BibTeX)

@article{leworldmodel2026,
  author    = {Maes, Lucas and Le Lidec, Quentin and Scieur, Damien and LeCun, Yann and Balestriero, Randall},
  title     = {LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels},
  year      = {2026},
}

LeWorldModel:

Stable End-to-End JEPA from Pixels

Abstract

Approach

Planning with LeWM

Efficient Planning

Planning Results

Evaluating Physical Understanding

Physical Latent Probing

Decoding Latent Space

Visualizing Latent Space

Violation-of-Expectation

PushT

Cube

Citation (BibTeX)