Regularization & KL Control
Study KL divergence, entropy bonuses, and auxiliary losses that keep RLHF stable.
- Estimated time
- 30 minutes
- Difficulty
- intermediate
- Prerequisites
- 2 module(s)
KL Regularised Objective
Chapter 8 writes the RLHF reward as the learned reward minus regularisation terms. Using the learned reward and KL penalty weight , the per-sample reward becomes:
Auxiliary losses (e.g., log-likelihood on pretraining data or margin terms) can be added with additional coefficients, as described in Equations (24–27).
Controlling Over-Optimisation
RLHF optimises against proxy rewards. Without constraints, models “game” the reward model—producing verbose or nonsensical outputs. KL penalties keep updates close to the reference policy, while entropy bonuses and auxiliary NLL terms maintain diversity and factual accuracy.
Chapter 8 encourages monitoring KL, reward statistics, and qualitative samples together. Regularisation is not a static formula but a feedback loop: adjust λ, entropy bonuses, or auxiliary losses when you observe reward hacking or over-refusal.
Tuning checklist:
- Target KL. Track actual KL versus a desired range; adapt λ accordingly.
- Entropy bonus. Use small bonuses to sustain exploration against strong KL weights.
- Auxiliary NLL. Periodically reinforce pretraining accuracy to avoid preference displacement.
Analogy: Safety Harness for RLHF
The base model is a climber scaling new routes. KL penalties are the safety harness keeping the climber connected to the wall; entropy bonuses and auxiliary losses are the belayer adjusting slack to balance safety and freedom.
Safety harness
Keeps the climber (policy) tethered to the wall (reference model) so new moves never drift too far during RLHF optimisation.
Coach with metronome
Sets the tempo for exploration versus discipline—mirroring KL weights, entropy bonuses, and auxiliary losses from Chapter 8.
Regularisation Lab
Tweak λ, target KL, and entropy bonuses to see how Chapter 8’s techniques shape reward curves and trade-offs.
KL penalty playground
Adjust λ to see how KL regularisation pulls the effective reward curve toward the reference model, as discussed in Chapter 8.
Parameters
Effective reward ≈ 2.73 after KL penalty; average KL ≈ 0.45. Chapter 8 recommends tuning λ so models learn without drifting into degenerate behaviour.
Regularisation trade-offs
Balance KL weight and entropy bonus to manage stability versus exploration, echoing Chapter 8’s guidance.
Parameters
Stability
66%
Exploration
67%
Over-optimisation risk
30%
Chapter 8 notes that KL penalties, entropy bonuses, and auxiliary NLL terms work in concert. Use this control to reason about the qualitative effects before diving into training logs.
Operational Notes
- KL penalties keep policies close to a trusted reference; adapt λ to hit a target KL range.
- Entropy bonuses and auxiliary NLL terms complement KL by preserving exploration and factual grounding.
- Monitor reward hacking, mode collapse, and preference displacement—regularisation is a response to observed failure modes.
- Reference models can be instruction-tuned checkpoints or previous RLHF snapshots; pick one with reliable behaviour.
- Regularisation choices should be documented alongside hyperparameters for reproducibility.
Regularisation Check
Confirm your understanding of KL control, auxiliary losses, and Chapter 8’s guidance.
Answered 0/5 · Correct 0/5
- 1
How does Chapter 8 define the KL regularised reward used in RLHF?
- 2
What intuition does the book provide for the KL penalty?
- 3
Why do some teams add an auxiliary NLL term according to Chapter 8?
- 4
What role do entropy bonuses play in regularisation?
- 5
Which failure mode motivates regularisation in Chapter 8?