LeWorldModel:

Stable End-to-End JEPA from Pixels

1Mila & Université de Montréal, 2New York University 3Samsung SAIL 4Brown University
*Equal Contribution

Abstract

Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48× faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.

TL;DR: LeWM is a JEPA-based world model that avoids representation collapse using a simple Gaussian regularizer (SIGReg), trains end-to-end from pixels with only two loss terms, and achieves competitive control performance at a fraction of the compute cost.

Approach

Model Architecture. LeWM is built upon two components: an encoder and a predictor. The encoder maps a given frame observation \(\vo_t\) into a compact, low-dimensional latent representation \(\vz_t\). The predictor models the environment dynamics in latent space by predicting the embedding of the next frame observation \(\hat{\vz}_{t+1}\) given the latent embedding \(\vz_t\) and an action \(\va_t\).

$$\text{LeWorldModel} \left\{\begin{aligned} \text{Encoder:} \quad & \vz_t = \enc(\vo_t) \\ \text{Predictor:} \quad & \hat{\vz}_{t+1} = \pred(\vz_t, \va_t) \end{aligned}\right.$$

Training Objective. The complete LeWM training objective combines a classical prediction loss \(\gL_{\rm pred}\) with a regularization term:

$$\gL_{\rm LeWM} \triangleq \gL_{\rm pred} + \lambda\,\mathrm{SIGReg}(\mZ)$$

The prediction loss \(\gL_{\rm pred}\) is a standard latent prediction loss. SIGReg is a regularization enforcing a Gaussian distribution of the latent space; we refer to LeJEPA for details.

Planning with LeWM

LeWM plans purely from pixels, with no proprioceptive information used at any stage. At test time, LeWM encodes a start and goal image into latent space, then uses the Cross-Entropy Method to optimize an action sequence by rolling out candidates through the predictor and picking those whose final embedding lands closest to the goal. Because each frame is encoded as a single 192-dim token (roughly 200× fewer tokens than DINO-WM), planning completes in about ~1 second versus 47 seconds for DINO-WM, a 48× speedup. We perform ablations on several design choices of LeWM and find that LeWM reaches similar performances while being orders of magnitude more efficient than DINO-WM.


Efficient Planning

Planning performance at a fixed compute budget (FLOPs). LeWM achieves competitive results with a fraction of the computation required by baselines.

Planning time comparison
Push-T fixed FLOPs
OGBench Cube fixed FLOPs

Planning Results

Planning performance across four environments: Two-Room (2D navigation), Reacher (2-joint arm control), Push-T (block manipulation), and OGBench-Cube (3D robotic pick-and-place). LeWM outperforms PLDM on all challenging tasks and surpasses DINO-WM on Push-T and Reacher, even without pre-trained features. On Push-T, LeWM beats DINO-WM even when DINO-WM uses additional proprioceptive inputs. DINO-WM retains an edge on the visually complex 3D OGBench-Cube task, likely due to richer visual priors from large-scale pretraining. LeWM underperforms on Two-Room; we suspect this is due to the intrinsic dimensionality of the task being too low, which may hinder the Gaussian regularizer from producing a well-structured latent space.

Two-Room

Two-Room planning Two-Room planning plot

Reacher

Reacher planning Reacher planning plot

Push-T

Push-T planning Push-T planning plot

OGBench Cube

OGBench Cube planning OGBench Cube planning plot

Additional qualitative rollouts for each environment are shown below, including both success and failure cases. Each clip shows two frames side by side: left is the planning rollout and right is the visual goal.

Two-Room

Success

Two-Room success

Success

Two-Room success

Failure

Two-Room failure
Reacher

Success

Reacher success

Success

Reacher success

Failure

Reacher failure
Push-T

Success

Push-T success

Success

Push-T success

Failure

Push-T failure
OGBench Cube

Success

Cube success

Failure

Cube failure

Failure

Cube failure

Evaluating Physical Understanding


Physical Latent Probing

We evaluate which physical quantities are recoverable from LeWM's latent representations by training a lightweight supervised probe to predict physical quantities of interest from latent embeddings.

Physical latent probing results on Push-T. LeWM consistently outperforms PLDM while remaining competitive with DINO-WM. The strong probing performance of DINO-WM on certain properties may stem from its foundation-model pretraining: the DINOv2 encoder is trained on two orders of magnitude more data (∼124M images) spanning a far more diverse distribution, which likely allows it to capture some physical properties in its embeddings by default.

Model Agent Location Block Location Block Angle
Linear MLP Linear MLP Linear MLP
MSE ↓r ↑ MSE ↓r ↑ MSE ↓r ↑ MSE ↓r ↑ MSE ↓r ↑ MSE ↓r ↑
DINO-WM 1.8880.977 0.0030.999 0.0060.997 0.0020.999 0.0500.979 0.0090.995
PLDM 0.0900.955 0.0140.993 0.1220.938 0.0110.994 0.4460.745 0.0560.972
LeWM 0.0520.974 0.0040.998 0.0290.986 0.0010.999 0.1870.902 0.0210.990

Decoding Latent Space

To visualize the predictions made by LeWM, we train a lightweight decoder (used only for visualization, not during training) to reconstruct images from the CLS token embedding. For each environment, we report: (i) the original video, (ii) the video obtained by encoding and decoding each frame independently, and (iii) the video of latent predictions produced by the world model when conditioned on an action sequence. We can see that LeWM captures the most important parts of the scene and reproduces environment dynamics.

Two-Room decoding

Two-Room

Reacher decoding

Reacher

Push-T decoding

Push-T

OGBench Cube decoding

OGBench Cube


Visualizing Latent Space

A t-SNE visualization of the latent space in the PushT environment suggests that the learned representation captures the spatial structure of the environment, preserving neighborhood relationships and relative positions in the latent space.

t-SNE LeJEPA

LeWM


Violation-of-Expectation

To evaluate the physical understanding of our models, we follow the violation-of-expectation paradigm, where a world model should assign higher surprise to events that contradict learned physical regularities.

PushT

Unperturbed

Block color change

Teleportation

Surprise scores for PushT

Surprise scores

Cube

Unperturbed

Cube color change

Teleportation

Surprise scores for Cube

Surprise scores

Citation (BibTeX)

@article{leworldmodel2026,
  author    = {Maes, Lucas and Le Lidec, Quentin and Scieur, Damien and LeCun, Yann and Balestriero, Randall},
  title     = {LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels},
  year      = {2026},
}