Regularization & KL Control • RLHF Module

Equation

KL Regularised Objective

Chapter 8 writes the RLHF reward as the learned reward minus regularisation terms. Using the learned reward $r_\theta$ and KL penalty weight $\lambda$ , the per-sample reward becomes:

r = r_\theta - \lambda D_{\mathrm{KL}}(\pi_{RL}(y|x) \Vert \pi_{\text{ref}}(y|x))

Auxiliary losses (e.g., log-likelihood on pretraining data or margin terms) can be added with additional coefficients, as described in Equations (24–27).

Intuition

Controlling Over-Optimisation

RLHF optimises against proxy rewards. Without constraints, models “game” the reward model—producing verbose or nonsensical outputs. KL penalties keep updates close to the reference policy, while entropy bonuses and auxiliary NLL terms maintain diversity and factual accuracy.

Chapter 8 encourages monitoring KL, reward statistics, and qualitative samples together. Regularisation is not a static formula but a feedback loop: adjust λ, entropy bonuses, or auxiliary losses when you observe reward hacking or over-refusal.

Tuning checklist:

Target KL. Track actual KL versus a desired range; adapt λ accordingly.
Entropy bonus. Use small bonuses to sustain exploration against strong KL weights.
Auxiliary NLL. Periodically reinforce pretraining accuracy to avoid preference displacement.

Analogy

Analogy: Safety Harness for RLHF

The base model is a climber scaling new routes. KL penalties are the safety harness keeping the climber connected to the wall; entropy bonuses and auxiliary losses are the belayer adjusting slack to balance safety and freedom.

Safety harness

Keeps the climber (policy) tethered to the wall (reference model) so new moves never drift too far during RLHF optimisation.

Coach with metronome

Sets the tempo for exploration versus discipline—mirroring KL weights, entropy bonuses, and auxiliary losses from Chapter 8.

Visualization

Regularisation Lab

Tweak λ, target KL, and entropy bonuses to see how Chapter 8’s techniques shape reward curves and trade-offs.

KL penalty playground

Adjust λ to see how KL regularisation pulls the effective reward curve toward the reference model, as discussed in Chapter 8.

Parameters

KL weight λ0.03

Higher λ emphasises staying near the reference policy.

Target KL0.60

Visual guideline for acceptable KL range.

Interactive visualization

Analogy: atari

Effective reward ≈ 2.73 after KL penalty; average KL ≈ 0.45. Chapter 8 recommends tuning λ so models learn without drifting into degenerate behaviour.

Regularisation trade-offs

Balance KL weight and entropy bonus to manage stability versus exploration, echoing Chapter 8’s guidance.

Parameters

KL weight λ0.03

Higher λ stabilises optimisation but may dampen exploration.

Entropy bonus0.02

Encourages diversity to counterbalance large KL penalties.

Interactive visualization

Analogy: atari

Stability

66%

Exploration

67%

Over-optimisation risk

30%

Chapter 8 notes that KL penalties, entropy bonuses, and auxiliary NLL terms work in concert. Use this control to reason about the qualitative effects before diving into training logs.

Takeaways

Operational Notes

KL penalties keep policies close to a trusted reference; adapt λ to hit a target KL range.
Entropy bonuses and auxiliary NLL terms complement KL by preserving exploration and factual grounding.
Monitor reward hacking, mode collapse, and preference displacement—regularisation is a response to observed failure modes.
Reference models can be instruction-tuned checkpoints or previous RLHF snapshots; pick one with reliable behaviour.
Regularisation choices should be documented alongside hyperparameters for reproducibility.

Self-check

Regularisation Check

Confirm your understanding of KL control, auxiliary losses, and Chapter 8’s guidance.

Answered 0/5 · Correct 0/5

1
How does Chapter 8 define the KL regularised reward used in RLHF?
2
What intuition does the book provide for the KL penalty?
3
Why do some teams add an auxiliary NLL term according to Chapter 8?
4
What role do entropy bonuses play in regularisation?
5
Which failure mode motivates regularisation in Chapter 8?