Introduction to RLHF
Learn why RLHF emerged, its training loop, and how preferences become better policies.
- Estimated time
- 20 minutes
- Difficulty
- beginner
- Prerequisites
- None
RLHF Training Objective
The RLHF fine-tuning objective balances reward model scores with a KL term that keeps the policy close to the supervised reference model:
Notation. is the policy we are training, is a learned reward model and is the supervised fine-tuned (SFT) model that anchors behaviour. The scalar controls how daring the policy can be before drifting too far from SFT.
Why Human Feedback Matters
Large language models excel at pattern matching but struggle to capture human preferences directly from text corpora. RLHF introduces a feedback loop: humans express which responses are better, a reward model imitates those judgements, and policy optimisation nudges the model toward behaviour people actually like. The KL penalty acts like a safety tether, ensuring the policy explores new outputs without forgetting the supervised baseline.
In practice this loop transforms vague alignment goals into a concrete optimisation surface. Instead of guessing what a "helpful" answer looks like, the model receives dense reward signals derived from real comparisons.
The InstructGPT pipeline that popularised RLHF (Chapter 1 of the
RLHF book
) follows four stages:
Pretraining: learn broad language modelling from web-scale corpora.
Supervised fine-tuning (SFT): train on curated demonstrations to establish a safe baseline.
Reward modelling: gather human preference comparisons and fit a reward model.
RLHF optimisation: update the policy against the reward model while constraining with a KL penalty.
Each subsequent module in this guide drills into one of these phases.
Analogy: Writing Coach Meets Arcade Bot
Picture two characters from the analogy toolbox working together:
Writing student
Drafts multiple completions and absorbs iterative editor feedback. The reward model is that editor, predicting which draft the human would favour.
Arcade bot
Treats each prompt like an arcade level. Higher reward model scores mean better "points"; the KL term is the bumper keeping the bot from reckless, off-distribution moves.
Together they illustrate how RLHF blends human judgement (editor) with reinforcement-style exploration (arcade bot) to polish model behaviour.
The RLHF Loop
- Collect prompts and generate candidate completions with the SFT model.
- Ask humans to rank the completions. Store preference pairs.
- Train a reward model $r_\phi$ to predict those preferences.
Optimise the policy $\pi_\theta$ with PPO/other RL algorithms using the reward model and KL penalty.
- Periodically refresh data and repeat, keeping humans in the loop.
Want to see the full loop animated? Jump over to the Reward Modeling and Policy Gradients modules once they go live.
Policy reward progression
Adjust the KL penalty to see how exploration pressure changes expected reward across training steps.
Parameters
Compare against the supervised model without RLHF updates.
Key Takeaways
- RLHF adds a learned reward signal on top of supervised fine-tuning.
- Human preference data is distilled into a reward model $r_\phi$.
- The KL divergence keeps the policy grounded near the SFT model.
- Analogy lenses (writer, tutor, arcade bot) make the loop easier to reason about.
The InstructGPT pipeline guides modern post-training systems: pretrain → SFT → reward modelling → RLHF.
Quick Self-Check
Work through the questions below and use the instant feedback to spot any gaps.
Answered 0/5 · Correct 0/5
- 1
Why does the RLHF objective include a KL penalty against the supervised reference model?
- 2
What role does the reward model play compared to human annotators?
- 3
Which analogy best captures the iterative editing loop introduced in this module?
- 4
After training the reward model, what happens next in the RLHF loop described here?
- 5
What keeps the RLHF system aligned over time according to the introduction?