Introduction to RLHF • RLHF Module

Equation

RLHF Training Objective

The RLHF fine-tuning objective balances reward model scores with a KL term that keeps the policy close to the supervised reference model:

\max_{\pi_\theta} \; \mathbb{E}_{(x, y) \sim \mathcal{D}_{\pi_\theta}}\left[r_\phi(x, y) - \beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}\right]

Notation. $\pi_\theta$ is the policy we are training, $r_\phi$ is a learned reward model and $\pi_{\mathrm{ref}}$ is the supervised fine-tuned (SFT) model that anchors behaviour. The scalar $\beta$ controls how daring the policy can be before drifting too far from SFT.

Intuition

Why Human Feedback Matters

Large language models excel at pattern matching but struggle to capture human preferences directly from text corpora. RLHF introduces a feedback loop: humans express which responses are better, a reward model imitates those judgements, and policy optimisation nudges the model toward behaviour people actually like. The KL penalty acts like a safety tether, ensuring the policy explores new outputs without forgetting the supervised baseline.

In practice this loop transforms vague alignment goals into a concrete optimisation surface. Instead of guessing what a "helpful" answer looks like, the model receives dense reward signals derived from real comparisons.

The InstructGPT pipeline that popularised RLHF (Chapter 1 of the

RLHF book

) follows four stages:

Pretraining: learn broad language modelling from web-scale corpora.
Supervised fine-tuning (SFT): train on curated demonstrations to establish a safe baseline.
Reward modelling: gather human preference comparisons and fit a reward model.
RLHF optimisation: update the policy against the reward model while constraining with a KL penalty.

Each subsequent module in this guide drills into one of these phases.

Analogy

Analogy: Writing Coach Meets Arcade Bot

Picture two characters from the analogy toolbox working together:

Writing student

Drafts multiple completions and absorbs iterative editor feedback. The reward model is that editor, predicting which draft the human would favour.

Arcade bot

Treats each prompt like an arcade level. Higher reward model scores mean better "points"; the KL term is the bumper keeping the bot from reckless, off-distribution moves.

Together they illustrate how RLHF blends human judgement (editor) with reinforcement-style exploration (arcade bot) to polish model behaviour.

Visualization

The RLHF Loop

Collect prompts and generate candidate completions with the SFT model.
Ask humans to rank the completions. Store preference pairs.
Train a reward model $r_\phi$ to predict those preferences.
Optimise the policy $\pi_\theta$ with PPO/other RL algorithms using the reward model and KL penalty.
Periodically refresh data and repeat, keeping humans in the loop.

Want to see the full loop animated? Jump over to the Reward Modeling and Policy Gradients modules once they go live.

Policy reward progression

Adjust the KL penalty to see how exploration pressure changes expected reward across training steps.

Parameters

KL penalty0.30

Higher values keep the policy closer to the SFT reference; lower values allow more exploration.

Show SFT baseline

Compare against the supervised model without RLHF updates.

Interactive visualization

Analogy: atari

Takeaways

Key Takeaways

RLHF adds a learned reward signal on top of supervised fine-tuning.
Human preference data is distilled into a reward model $r_\phi$.
The KL divergence keeps the policy grounded near the SFT model.
Analogy lenses (writer, tutor, arcade bot) make the loop easier to reason about.
The InstructGPT pipeline guides modern post-training systems: pretrain → SFT → reward modelling → RLHF.

Self-check

Quick Self-Check

Work through the questions below and use the instant feedback to spot any gaps.

Answered 0/5 · Correct 0/5

1
Why does the RLHF objective include a KL penalty against the supervised reference model?
2
What role does the reward model play compared to human annotators?
3
Which analogy best captures the iterative editing loop introduced in this module?
4
After training the reward model, what happens next in the RLHF loop described here?
5
What keeps the RLHF system aligned over time according to the introduction?