The Dragon Hatchling (BDH), Simply Explained

AI, But Simple Issue #105

The Dragon Hatchling (BDH), Simply Explained

AI, But Simple Issue #105

For nearly a decade, the Transformer architecture has been the undisputed frontrunner of modern deep learning.

However, as models scale into the trillions of parameters, the quadratic scaling O(n2 ) of the self-attention mechanism is a harsh limitation.

As a model’s context window grows, the Key-Value (KV) cache memory requirements explode, making an ideal "infinite context," where a model continuously learns like our brains do, practically impossible on standard hardware.

Additionally, recent complexity theory proofs suggest that autoregressive attention is mathematically bound to hallucinate at predictable rates.

In response, recent research has seen a pivot toward "Post-Transformer" architectures.

  • These models prioritize linear-time inference, constant memory bounds, dynamic state evolution, and the ability to learn continuously without forgetting.

In the post-transformer field, we’ve seen the emergence of State-Space Models, and linear transformer backbones such as the Gated DeltaNet, among other architectures.

One specific post-transformer architecture sticks out from the rest. Interestingly, The Dragon Hatchling (BDH) (Kosowski et al., 2025) models its architecture after the human brain.

Created by a team of researchers at Pathway (including ex-Google Brain, Inria, and Mila), the BDH is promising for long-horizon reasoning and is a strong candidate for continual learning.

What You’ll Learn

  1. The biological inspiration behind BDH

  2. The 2 fundamental operations enabling BDH reasoning

  3. The 4-stage BDH inference process

  4. BDH-GPU, a GPU-friendly BDH equivalent

What’s Helpful to Know

  • Continual Learning

    • A domain in deep learning research that focuses on models that have the ability to learn new information continuously without degrading or "forgetting" previous knowledge.

  • ALiBi & RoPE

    • Popular positional embedding techniques used to model relationships dependent on token positions. RoPE (Rotary Position Embedding) rotates vectors to encode relative distances, while ALiBi applies a penalty to naturally weaken distant tokens.

  • Truncated Backpropagation Through Time (TBPTT)

    • A training method used for recurrent models with a running memory state. TBPTT splits long sequences into smaller chunks to calculate weight updates, reducing GPU overhead at any given time.

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign in.Not now