Direct Preference Optimization (DPO) • RLHF Module

Equation

DPO Objective

Direct Preference Optimization (DPO) solves the RLHF objective without a reward model. Chapter 12 shows that the optimal policy can be written as a tilt of the reference policy using the human preference dataset.

\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_c, y_r) \sim \mathcal{D}}\Big[\log \sigma\big(\beta \big(\log \pi_\theta(y_c|x) - \log \pi_\theta(y_r|x) - \log \pi_{\text{ref}}(y_c|x) + \log \pi_{\text{ref}}(y_r|x)\big)\big)\Big]

DPO’s gradient (Equation 66 in the book) scales the difference between chosen and rejected log-probabilities by $\sigma(\beta \Delta)$ , where $\Delta$ is the logit gap relative to the reference model. No separate reward model is required.

Intuition

Offline Alignment Intuition

Instead of training a reward model and running PPO, DPO asks: “Can we tilt the policy so that it prefers the human choices directly?” The loss above does exactly this, weighting gradient steps by how much the current model already agrees with annotators. When the model already prefers the human answer, the gradients shrink; when it still prefers the rejected answer, the gradients grow.

DPO shares the KL safety tether with PPO—it still keeps the policy near the reference—but enjoys a simpler training loop. Chapter 12 compares DPO to other direct alignment algorithms like IPO and cDPO that add margins, calibration, or additional weighting to address preference displacement.

Offline algorithms shine when you have a strong preference dataset but limited rollout budget. Chapter 12 cautions that β must be tuned carefully to avoid overfitting and that high-quality references remain critical.

Analogy

Analogy: Debate Judge & Apprentice

Imagine a debate judge who already evaluated pairs of arguments. The apprentice rereads the transcripts and adjusts their wording to match the judge’s favourite picks. No new debates are hosted. DPO is that offline practice loop.

Debate judge

Keeps a transcript of human-preferred answers and nudges the apprentice to argue like the winner without ever training a separate reward model.

Apprentice debater

Adjusts wording directly to match the judge’s notes. DPO’s gradients push the apprentice to copy human choices while staying close to the reference model.

Visualization

DPO Playground

Experiment with the β temperature, optional margins, and logit gaps to see how DPO weights each comparison. Then inspect the loss curve compared to the reference policy.

DPO weighting playground

Adjust β, policy logits, and margin to see how DPO emphasises chosen vs. rejected samples without a reward model.

Parameters

β (temperature)0.10

Higher β sharpens the preference weight difference, as suggested in Chapter 12.

Margin m0.00

Optional safety margin on logit gaps used by variants like IPO/cDPO.

Policy logit shift0.60

Controls how far the current policy deviates from the reference logits.

Interactive visualization

Analogy: atari

Chosen completion

Polite answer with concrete next steps.

weight ≈ 0.52

π_θ(y|x) logprob ≈ -0.60 vs. π_ref(y|x) ≈ -1.00

β · Δ = 0.04

DPO multiplies the gradient for each sample by σ(β Δ). This illustration mirrors the weighting described in Chapter 12’s derivative form.

Rejected completion

Dismissive answer that ignores the request.

weight ≈ 0.48

π_θ(y|x) logprob ≈ -1.40 vs. π_ref(y|x) ≈ -1.00

β · Δ = -0.04

DPO multiplies the gradient for each sample by σ(β Δ). This illustration mirrors the weighting described in Chapter 12’s derivative form.

DPO loss curve

Inspect how the negative log-sigmoid term behaves as the policy moves away from the reference, echoing the derivation in Chapter 12.

Parameters

β0.10

Sharper β values make the loss grow faster when policy and reference disagree.

Logit scale1.00

Scales the logit gap, simulating confidence changes in the dataset.

Interactive visualization

Analogy: atari

Minimum loss ≈ 0.513 when the policy aligns with human preference. Large negative deltas (rewarding the rejected response) push the loss rapidly upward, mirroring the cautionary notes in Chapter 12 about preference displacement.

Chapter 12 provides a concise implementation sketch. Below is a PyTorch-style training step that mirrors the book’s guidance: compute log-probs for the chosen and rejected completions, subtract the reference log-probs, and apply the log-sigmoid loss.

import torch
import torch.nn.functional as F

def dpo_step(model, ref_logits, batch, beta=0.1):
    logp_chosen = model.log_probs(batch['input_ids_chosen'], batch['attention_mask'])
    logp_rejected = model.log_probs(batch['input_ids_rejected'], batch['attention_mask'])

    ref_chosen = ref_logits['chosen']
    ref_rejected = ref_logits['rejected']

    delta = (logp_chosen - logp_rejected) - (ref_chosen - ref_rejected)
    loss = -F.logsigmoid(beta * delta).mean()

    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    return loss.item()

Takeaways

Operational Notes

DPO optimises policy log-probabilities directly, avoiding a reward model but retaining the KL tether to the reference.
β controls the trade-off between staying close to the reference and matching human preferences—large β can cause preference displacement.
Variants such as IPO, cDPO, and REBEL add margins or calibration to mitigate dataset biases (Chapter 12.2).
Offline processing reduces compute but depends on high-quality preference comparisons and good reference models.
Calibrated evaluation (RewardBench, AlpacaEval with length correction) remains necessary to catch displacement effects.

Self-check

DPO Check

Quiz yourself on DPO’s formulation, β tuning, and practical concerns from Chapter 12.

Answered 0/5 · Correct 0/5

1
Which expression matches the DPO objective from Chapter 12?
2
What role does β play in DPO?
3
What is preference displacement?
4
How does DPO differ from PPO in data usage?
5
Which variant extends DPO with calibration against a reward model?