- AI, But Simple
- Posts
- The Dragon Hatchling (BDH), Simply Explained
The Dragon Hatchling (BDH), Simply Explained
AI, But Simple Issue #105

The Dragon Hatchling (BDH), Simply Explained
AI, But Simple Issue #105
For nearly a decade, the Transformer architecture has been the undisputed frontrunner of modern deep learning.
However, as models scale into the trillions of parameters, the quadratic scaling O(n2 ) of the self-attention mechanism is a harsh limitation.
As a model’s context window grows, the Key-Value (KV) cache memory requirements explode, making an ideal "infinite context," where a model continuously learns like our brains do, practically impossible on standard hardware.
Additionally, recent complexity theory proofs suggest that autoregressive attention is mathematically bound to hallucinate at predictable rates.
In response, recent research has seen a pivot toward "Post-Transformer" architectures.
These models prioritize linear-time inference, constant memory bounds, dynamic state evolution, and the ability to learn continuously without forgetting.

In the post-transformer field, we’ve seen the emergence of State-Space Models, and linear transformer backbones such as the Gated DeltaNet, among other architectures.
One specific post-transformer architecture sticks out from the rest. Interestingly, The Dragon Hatchling (BDH) (Kosowski et al., 2025) models its architecture after the human brain.
Created by a team of researchers at Pathway (including ex-Google Brain, Inria, and Mila), the BDH is promising for long-horizon reasoning and is a strong candidate for continual learning.
What You’ll Learn
The biological inspiration behind BDH
The 2 fundamental operations enabling BDH reasoning
The 4-stage BDH inference process
BDH-GPU, a GPU-friendly BDH equivalent
What’s Helpful to Know
Continual Learning
A domain in deep learning research that focuses on models that have the ability to learn new information continuously without degrading or "forgetting" previous knowledge.
ALiBi & RoPE
Popular positional embedding techniques used to model relationships dependent on token positions. RoPE (Rotary Position Embedding) rotates vectors to encode relative distances, while ALiBi applies a penalty to naturally weaken distant tokens.

Truncated Backpropagation Through Time (TBPTT)
A training method used for recurrent models with a running memory state. TBPTT splits long sequences into smaller chunks to calculate weight updates, reducing GPU overhead at any given time.