Blog

What We Learned Validating JEPA on a Laptop

We reproduced the core claims of two 2025-26 JEPA papers (CrossJEPA and LeJEPA) at toy scale on an M5 MacBook Air. The mechanisms hold. Here is what that means for enterprises deciding whether self-supervised representation learning is real or hype.

Published Reading ...
What We Learned Validating JEPA on a Laptop

Joint-Embedding Predictive Architectures (JEPA) are Yann LeCun’s bet on what comes after generative pretraining. Instead of predicting pixels or tokens, a JEPA predicts in representation space: it learns to guess the embedding of one view of the world from another. The pitch is that this produces world models that capture structure, not surface detail, and that they do it without labels.

That is an architecture claim. Architecture claims are testable.

Over the last month we took two of the most consequential JEPA papers of the cycle and rebuilt their central experiments from scratch, at toy scale, in MLX, on a fanless M5 MacBook Air. Not to reproduce their benchmark numbers. To isolate the mechanisms the papers credit for their results and check whether each one actually does what is claimed.

Both held. This post is what we found, how we tested it, and why a CIO should care that we did.


Why validate at all

There is a standard failure mode in enterprise AI: a paper reports a state-of-the-art number, a vendor cites the paper, and a buyer assumes the number transfers to their problem. It usually does not. The benchmark number bundles together the idea, the scale, the data, and a hundred tuning decisions. When the idea ships into a different domain at a different scale, the number evaporates and nobody can say which part broke.

The defensible question is narrower. Strip the scale away. Does the mechanism the paper credits for its result actually fire? If it does, the idea is real and the only open question is whether it survives scale-up. If it does not, the benchmark number was carried by something else and the idea is a liability.

That is the question a toy validation answers, and it answers it for the cost of a laptop afternoon instead of a GPU cluster.


Paper 1: CrossJEPA — teaching a 3D model from 2D images

CrossJEPA (Perera et al., 2025) tackles a real industrial problem: you have lots of 2D images and a strong frozen image model, and you want a small 3D point-cloud model. The naive routes are expensive or fragile. Reconstruct the geometry pixel-by-pixel and you burn the small model’s capacity on low-level detail. Co-train a contrastive 2D-3D space and you risk collapse.

CrossJEPA’s move:

Train a small predictor to guess a frozen image model’s embedding of a specific rendered view of the object. Hand that predictor the things the 3D model cannot know — camera pose, color — as side input. Don’t mask anything.

The side input is the clever part. Pose and color variation get absorbed by the predictor instead of polluting the 3D encoder. The side channel acts as a gradient sink. The encoder is pushed toward the shared 2D-3D content: shape and semantics.

The paper credits three mechanisms. We tested all three.

The setup

PaperOur validation
ShapeNet + Objaverse CAD8 procedural 3D primitive classes
Frozen DINOv2 ViT-B/14 teacherFrozen DINOv2-small, run once
8.5M-param point transformer~1.1M-param tiny transformer
One-time embedding cacheteacher.py caches all view embeddings to .npz
RTX 4090, ~6 GPU-hoursM5 MacBook Air, ~8 min sweep

One design choice made the ablations honest: the stored point cloud is rotated by a controlled rotation while the render stays canonical. That makes camera pose genuinely target-only information the point encoder cannot recover, which is exactly what makes the conditioning test meaningful. Color is only weakly class-correlated, so it is a true nuisance variable and not a label leak.

Result 1: conditioning purifies the signal

Feed the predictor more target-only side information. Watch the linear-probe accuracy of the 3D encoder.

Predictor conditioningLinear-probe accuracy
none76.9% ± 5.8
+ camera pose78.6% ± 3.1
+ pose + color81.9% ± 2.4

Monotonic. +5.0 points end to end. This is the load-bearing claim and it held. The standard deviation shrinks as conditioning is added, which matches the paper’s lower-variance-gradient argument. The signal gets better and more stable at the same time.

Result 2: predict embeddings, not pixels

ObjectiveLinear-probe accuracy
JEPA (predict teacher embedding)81.9% ± 2.8
Reconstruction (predict pixels)73.3% ± 8.3

+8.6 points for latent prediction at equal budget. Reconstruction also carries far more variance. Forcing the encoder to reproduce pixel detail injects noise, exactly as the paper argues. This is the whole “predict in representation space” thesis, isolated.

Result 3: masking is not required

I-JEPA inherited masking from the masked-autoencoder lineage. CrossJEPA claims it is not intrinsic to JEPA and can hurt in the cross-modal setting.

Masking ratioLinear-probe accuracy
0.081.7% ± 2.0
0.381.7% ± 2.0
0.580.6% ± 6.2
0.7576.9% ± 5.8

Flat to 0.3, then it degrades. The expensive masking machinery is not needed here. The full input can be fed during training.

The economics underneath

The mechanism that makes this cheap is the frozen teacher plus a one-time embedding cache. The teacher never runs during training. In our harness it was a single ~2 min pass, after which the entire 27-run ablation sweep took ~8 min on a laptop. That is the difference between a GPU-server job and a job you run on the machine you already own.


Paper 2: LeJEPA — when does this actually learn a world model?

The harder question is theoretical. JEPAs work empirically, but why, and under what conditions? When Does LeJEPA Learn a World Model? (Klindt, LeCun, Balestriero, 2026) gives a proof.

LeJEPA is a stripped-down recipe: pull positive pairs together (alignment) while keeping the embedding distribution an isotropic Gaussian (Sketched Isotropic Gaussian Regularization, SIGReg). No negatives. No stop-gradient teacher. No architecture tricks. The paper proves this recovers the world’s latent variables up to a global rotation — “linear identifiability” — and that the Gaussian is the unique latent distribution for which this holds.

We reproduced the empirical signatures of all four theorems on a 2D Gaussian world with a 4-layer MLP. Full run: ~6 min, three seeds.

Theorem 1: forward identifiability

Push Gaussian latents through an unknown nonlinear mixing, train LeJEPA, measure how well a linear map recovers the true latents.

MixingR²(recovery)
parabolic shear0.998 ± 0.000
sinusoidal shear0.998 ± 0.000
RealNVP coupling0.990 ± 0.010
spiral0.849 ± 0.039

LeJEPA inverts each unknown nonlinear mixing up to rotation. The spiral is the hard case by design: it is measure-preserving, so Gaussianity of the observations is already satisfied and only alignment can drive recovery. It recovers more slowly and needs a larger step budget, exactly as the paper’s grid search predicts. The fact that the easy cases hit 0.998 is the signal.

Theorem 2: the Gaussian is special

Sweep the shape of the latent distribution. α=2 is Gaussian.

Latent shape αR²(recovery)
0.5 (heavy-tailed)0.964
1.0 (Laplace)0.989
1.50.997
2.0 (Gaussian)0.998
3.00.991
5.0 (toward uniform)0.978

Recovery peaks exactly at the Gaussian and degrades on both sides. The peak is gentle (~3 points) because SIGReg constrains the embedding beyond second moments and so has a wider robustness plateau than plain whitening, which is itself one of the paper’s findings. Directionally exact, and forgiving of mild non-Gaussianity. That forgiveness is good news for practitioners: the theory’s sweet spot is a basin, not a knife-edge.

Theorem 3: the error bound holds

The paper predicts recovery error stays below a bound built from the alignment gap and the whitening deviation.

30 out of 30 runs satisfied the bound. Every point sat below the line. Loose where the alignment gap dominates, tight near the optimum, exactly as derived.

Theorem 4: the representation is plannable

This is the payoff. If the encoder is identified up to rotation, a straight line in learned latent space should decode to a near-straight path in the true latent space, because a rotation does not bend straight lines.

EncoderR²(recovery)Decoded path-length ratio (ideal = 1)
trained LeJEPA0.9981.07 ± 0.12
frozen untrained (control)0.4912.0 ± 12.6

A planner using a rotation-invariant cost gets the true plan for free. The untrained control warps the same straight plan roughly 12×. Linear identifiability turns the representation into a usable state space. That is what “learns a world model” cashes out to.

The mechanism we watched directly

Everything hinges on a collapse / anti-collapse balance. Too little Gaussianity pressure and the encoder maps everything to a point; we watched the embedding covariance fall to zero within a few steps when SIGReg’s weight was too low. Too much and it overwhelms the predictive signal. In between, identifiability. Three regimes, exactly the paper’s Fig. 6.


What we are not claiming

This matters more than any table above, so we will be blunt about it.

These are toy-scale validations of mechanisms, not reproductions of benchmark numbers. 8 synthetic shape classes, ~1.1M params, a 2D world, a 133k-param MLP. We did not reproduce CrossJEPA’s 94.2% on ModelNet40 and we cannot at this scale. We did not reproduce LeJEPA’s 1024-dim or pixel-control experiments.

The defensible statement is precise: every architectural choice each paper credits for its results contributes in the predicted direction, and each is cheaply reproducible. The leap to production accuracy rests on the papers’ full-scale numbers, not ours. The honest next step for CrossJEPA is swapping synthetic shapes for real point clouds (ModelNet10) and seeing whether the mechanisms survive real geometry. That is the gap between “interesting” and “deployable,” and we name it on purpose.

We also had to make our own engineering calls that diverge from the reference code. LeJEPA’s SIGReg needed a moment-matching term added to a pure characteristic-function loss, because the pure version saturates at collapse with a vanishing gradient. Our loss scales differ from the paper’s, so our effective regularization weight is not comparable to theirs. The mechanism reproduces; the specific hyperparameter does not transfer. Saying so is the difference between a validation and a press release.


Why a CIO should care that we do this

Three reasons, in order of how much they should change a buying decision.

1. It separates real ideas from cited ideas. A vendor who has only read a paper can quote its headline number. A vendor who has rebuilt its experiment can tell you which of the paper’s claims is load-bearing, which is incidental, and which one quietly broke when they tried it. For CrossJEPA, the load-bearing claim is conditioning the predictor on target-only information, and it is the one with the smallest margin. That is the thing to watch in any deployment. You only know that by running it.

2. The cost structure is the actual product. Both papers’ real contribution to an enterprise is not accuracy, it is economics. CrossJEPA’s frozen-teacher-plus-cache turns 3D pretraining from a cluster job into a laptop job. LeJEPA removes negatives, teachers, and architecture-specific tricks, which removes the parts of self-supervised learning that are hardest to operate. When the mechanism that delivers the headline number is also the mechanism that collapses the cost, that is a result worth building on. We confirmed both cost mechanisms fire, not just the accuracy.

3. Honest scope is a procurement signal. The most important section above is “What we are not claiming.” A team that volunteers the limits of its own evidence is a team you can trust to tell you when a deployment is out of its competence. The opposite — a number with no stated scope — is the single most reliable predictor of an AI project that fails in its first year.

Where does JEPA actually pay off today? Where you have a strong frozen model, a hard cap on the size or budget of the model you need to ship, and no abundant labels. On-device 3D perception. Robotics grasping from a depth camera, where you already know the camera extrinsics so pose-conditioning is free. LiDAR-plus-camera pipelines in autonomous driving, where calibration is logged and the conditioning grounds itself. It does not pay off where you have abundant labeled data (supervise directly) or where you need a large model anyway (the efficiency argument evaporates). Knowing which side of that line your problem sits on is the whole game.


The takeaway

Self-supervised representation learning is moving from “trust the benchmark” to “here is the mechanism and here is the proof it fires.” That shift is good for buyers. It means the claims are checkable, the costs are predictable, and the limits are stateable.

We check them. On a laptop, in minutes, before we recommend anything. The number on the slide is where a conversation starts. The mechanism underneath it is where the engineering decision gets made.


Veso AI builds and validates representation-learning and agentic systems for enterprise. To talk through whether a self-supervised approach fits your data and your budget, get in touch.