World Models, Simply Explained

AI, But Simple Issue #98

World Models, Simply Explained

AI, But Simple Issue #98

Hello from the AI, but simple team! If you enjoy our content and custom visuals, consider sharing this newsletter with others or upgrading so we can keep doing what we do.

Large Language Models (LLMs) have become a centerpiece in modern AI. Not only are they extremely useful, scalable, and applicable in many industries, but they have also become the most used form of AI for the general public.

While LLMs demonstrate amazing pattern recognition and next-token prediction, they lack a grounded understanding of the physical world, making them unstable when deployed in new environments (e.g., autonomous driving, robotics).

In the pursuit of Artificial General Intelligence (AGI), researchers have shifted towards a new type of model to simulate the physical laws and relationships of reality.

World models are a new type of large-scale model that can understand the real world. At its core, a "world model" gives an artificial intelligence a form of imagination.

  • They are foundation models, meaning that they fall in the category of extremely large models trained on massive datasets.

  • They can be adapted to a wide range of downstream tasks like text generation, image recognition, etc. Think LLMs like ChatGPT, DeepSeek, or Claude.

Instead of an AI just reacting blindly to what it sees right now, a world model allows the agent to internally simulate, or "dream" about, the future before it makes a decision.

For instance, if an autonomous car is deciding whether to brake or swerve, its world model predicts how the pedestrians, other cars, and physics will react to both choices, allowing it to pick the safest option.

World models can take in text, image, video, and movement to generate videos and navigate realistic physical environments.

The Foundation Behind World Models

In mathematics, real-world environments are modeled as a Partially Observable Markov Decision Process (POMDP).

This is a technical way of saying that the model, or agent, never gets to see the true (perfect) state of the entire universe.

It only gets noisy, incomplete observations (such as the pixels from a camera or audio waves from a microphone), which we denote as xt.

The agent must then make decisions based on this uncertain information by learning what we call the transition dynamics.

In other words, the goal of a world model is to learn how the environment changes from one state to the next, expressed by the below mathematical probability function.

This is a crucial expression used in model-based RL, and essentially, it asks, "What is the probability of the next state st+1 occurring, given the current state st and the action the AI decides to take at?".

The key idea: world models approximate a latent-space version of this equation.

An important question to ask is, why do we use a latent space representation?

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign in.Not now