- AI, But Simple
- Posts
- Joint-Embedding Predictive Architecture (JEPA), Simply Explained
Joint-Embedding Predictive Architecture (JEPA), Simply Explained
AI, But Simple Issue #107

Joint-Embedding Predictive Architecture (JEPA), Simply Explained
AI, But Simple Issue #107
In recent years, Large Language Models (LLMs) and generative AI have been the main focus of AI research.
Although these models feature high-quality text generation (and outstanding fluency), in the pursuit of Artificial General Intelligence (AGI), Yann LeCun and others argue they lack a "world model."
Generative architectures like LLMs learn correlations between observations, but they don’t explicitly learn the abstract relationships (or rules) of the world that create those observations.
Additionally, from biology, humans and animals heavily rely on these abstract world models to learn enormous amounts of background knowledge, which we know as “common sense.”

The Joint-Embedding Predictive Architecture (JEPA) (Lecun, 2022) is a new predictive model that pulls from this biological inspiration to serve as a backbone for world modeling and higher-level reasoning.
What You’ll Learn
Discriminative, generative, and predictive architectures
Joint embedding architectures (JEAs), the structure behind JEPA
The component breakdown of how JEPA works
JEPA variants and applications
The current state of JEPA and world models
What’s Helpful to Know
Self-Supervised Learning (SSL)
SSL is a learning paradigm in which a learning system is trained to “fill in the blanks.” A model derives its own supervision (labels) from the data itself, learning relationships between observed and unobserved inputs.
Models that learn how the world works, or how an environment changes over time. World models have a form of imagination, where they can simulate (in latent space) future states for reasoning and planning.

Representation Collapse
When a model learns a trivial solution where many inputs map to the same representation, essentially making the embedding useless.
Contrastive Learning
A self-supervised technique where a model learns to pull representations of related (positive) pairs closer together in embedding space, while pushing unrelated (negative) pairs apart.