AI, But Simple
Posts
Reinforcement Learning for Large Language Models, Simply Explained

Reinforcement Learning for Large Language Models, Simply Explained

AI, But Simple Issue #97

Edwin Dong & Anurag Shinde
April 13, 2026

Reinforcement Learning for Large Language Models, Simply Explained

AI, But Simple Issue #97

Hello from the AI, but simple team! If you enjoy our content and custom visuals, consider sharing this newsletter with others or upgrading so we can keep doing what we do.

Training a language model to predict the next token is only the beginning of creating a useful LLM system. Feed it enough text, minimize cross-entropy: you get a system with a remarkable breadth and depth of knowledge, capable of performing just about any task.

What you do not automatically get is a system that is always accurate, helpful, or safe.

A model trained purely on next-token prediction will happily complete a harmful prompt, give confidently wrong answers, or produce responses that are technically correct but practically useless.

The gap between "can predict text" and "behaves well" turns out to be significant, and closing it requires a different kind of learning signal entirely.

Reinforcement learning provides that signal. Rather than supervising the model on what the correct next token is, RL methods focus on whether a complete response was useful, accurate, safe, or correct.

This distinction matters enormously. Judging whether an output is useful, accurate, or safe requires the whole response, not individual tokens, and it often cannot be easily obtained by extracting it from existing text data.

It has to be measured, either by humans or by a learned reward model, and fed back to the model as a training signal.

This issue traces the landscape of RL methods for LLMs, from the original RLHF pipeline to the newer GRPO and RLVR approaches that have pushed reasoning models like DeepSeek-R1 to frontier performance.

Why Standard Supervised Learning Is Not Enough

To understand why RL is needed, it helps to be precise about what Supervised Fine-Tuning (SFT) can and cannot do.

SFT trains a model to imitate a dataset of (input, output) pairs. It is excellent at teaching format, style, and domain knowledge.

If you have high-quality examples of helpful responses, SFT can teach the model to produce responses that look like them. The limitation is that SFT optimizes for token-level imitation, not for response quality.

The model learns to produce outputs that resemble the training examples, but it has no mechanism for preferring a response that is slightly different from the training data but actually “better.”

More fundamentally, human preferences are not fully captured by any existing text corpus.

What makes a response genuinely helpful, appropriately uncertain, or well-calibrated to a specific user's needs come from humans interacting with the model in real time, or approximated by a model that has learned to predict human preferences.

This is the gap that RL fills. The training signal in RL is not "this is the correct token sequence" but rather "this response is better than that one."

The model learns to produce responses that score highly on a measure of quality, even when that measure cannot be expressed as a fixed dataset.

Instruction Fine-Tuning (IFT), The Foundation

Before any RL is applied, most modern LLMs undergo instruction fine-tuning (IFT), also called supervised fine-tuning on instruction data.

This step bridges the gap between a raw pre-trained model and one that can engage with user requests coherently.

A raw pre-trained model is like a document completion engine. Ask it a question and it may respond by generating more questions, since that is a common pattern in text data.

IFT transforms this into a model that treats inputs as instructions and produces appropriate responses.

The training data takes the form of (instruction, response) pairs, and the loss is the standard autoregressive objective (cross-entropy loss) applied only to the response tokens:

Here, p_θ is the probability assigned by the current model,x is the instruction, and y is the response. The instruction tokens are used as context but are not supervised, only the response tokens contribute to the loss.

This single detail is what makes IFT produce instruction-following behavior rather than document-continuation behavior.

IFT alone produces surprisingly capable models. The LLaMA instruction-tuned variants, InstructGPT's initial SFT stage, and FLAN all demonstrate that with high-quality instruction data, a model can generalize well to new tasks.

However, IFT has a ceiling: it can only be as good as its training data, and the training data can only capture preferences that humans thought to write down explicitly.

RLHF: The Framework That Changed Everything

Reinforcement Learning from Human Feedback (RLHF) was the method that introduced a three-stage pipeline that has become the standard approach for aligning large language models.

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign in.Not now