Skip to main content
Advanced Concepts

Regularization & KL Control

Study KL divergence, entropy bonuses, and auxiliary losses that keep RLHF stable.

Estimated time
30 minutes
Difficulty
intermediate
Prerequisites
2 module(s)
Equation

KL Regularised Objective

Chapter 8 writes the RLHF reward as the learned reward minus regularisation terms. Using the learned reward rθr_\theta and KL penalty weight λ\lambda, the per-sample reward becomes:

r=rθλDKL(πRL(yx)πref(yx))r = r_\theta - \lambda D_{\mathrm{KL}}(\pi_{RL}(y|x) \Vert \pi_{\text{ref}}(y|x))

Auxiliary losses (e.g., log-likelihood on pretraining data or margin terms) can be added with additional coefficients, as described in Equations (24–27).

Intuition

Controlling Over-Optimisation

RLHF optimises against proxy rewards. Without constraints, models “game” the reward model—producing verbose or nonsensical outputs. KL penalties keep updates close to the reference policy, while entropy bonuses and auxiliary NLL terms maintain diversity and factual accuracy.

Chapter 8 encourages monitoring KL, reward statistics, and qualitative samples together. Regularisation is not a static formula but a feedback loop: adjust λ, entropy bonuses, or auxiliary losses when you observe reward hacking or over-refusal.

Tuning checklist:

  1. Target KL. Track actual KL versus a desired range; adapt λ accordingly.
  2. Entropy bonus. Use small bonuses to sustain exploration against strong KL weights.
  3. Auxiliary NLL. Periodically reinforce pretraining accuracy to avoid preference displacement.
Analogy

Analogy: Safety Harness for RLHF

The base model is a climber scaling new routes. KL penalties are the safety harness keeping the climber connected to the wall; entropy bonuses and auxiliary losses are the belayer adjusting slack to balance safety and freedom.

Safety harness

Keeps the climber (policy) tethered to the wall (reference model) so new moves never drift too far during RLHF optimisation.

Coach with metronome

Sets the tempo for exploration versus discipline—mirroring KL weights, entropy bonuses, and auxiliary losses from Chapter 8.

Visualization

Regularisation Lab

Tweak λ, target KL, and entropy bonuses to see how Chapter 8’s techniques shape reward curves and trade-offs.

KL penalty playground

Adjust λ to see how KL regularisation pulls the effective reward curve toward the reference model, as discussed in Chapter 8.

Parameters

Interactive visualization

Regularisation trade-offs

Balance KL weight and entropy bonus to manage stability versus exploration, echoing Chapter 8’s guidance.

Parameters

Interactive visualization
Takeaways

Operational Notes

  • KL penalties keep policies close to a trusted reference; adapt λ to hit a target KL range.
  • Entropy bonuses and auxiliary NLL terms complement KL by preserving exploration and factual grounding.
  • Monitor reward hacking, mode collapse, and preference displacement—regularisation is a response to observed failure modes.
  • Reference models can be instruction-tuned checkpoints or previous RLHF snapshots; pick one with reliable behaviour.
  • Regularisation choices should be documented alongside hyperparameters for reproducibility.
Self-check

Regularisation Check

Confirm your understanding of KL control, auxiliary losses, and Chapter 8’s guidance.

Answered 0/5 · Correct 0/5

  1. 1

    How does Chapter 8 define the KL regularised reward used in RLHF?

  2. 2

    What intuition does the book provide for the KL penalty?

  3. 3

    Why do some teams add an auxiliary NLL term according to Chapter 8?

  4. 4

    What role do entropy bonuses play in regularisation?

  5. 5

    Which failure mode motivates regularisation in Chapter 8?