Direct Preference Optimization (DPO)
Optimise policies directly against preference data without training a reward model.
- Estimated time
- 30 minutes
- Difficulty
- intermediate
- Prerequisites
- 2 module(s)
DPO Objective
Direct Preference Optimization (DPO) solves the RLHF objective without a reward model. Chapter 12 shows that the optimal policy can be written as a tilt of the reference policy using the human preference dataset.
DPO’s gradient (Equation 66 in the book) scales the difference between chosen and rejected log-probabilities by , where is the logit gap relative to the reference model. No separate reward model is required.
Offline Alignment Intuition
Instead of training a reward model and running PPO, DPO asks: “Can we tilt the policy so that it prefers the human choices directly?” The loss above does exactly this, weighting gradient steps by how much the current model already agrees with annotators. When the model already prefers the human answer, the gradients shrink; when it still prefers the rejected answer, the gradients grow.
DPO shares the KL safety tether with PPO—it still keeps the policy near the reference—but enjoys a simpler training loop. Chapter 12 compares DPO to other direct alignment algorithms like IPO and cDPO that add margins, calibration, or additional weighting to address preference displacement.
Offline algorithms shine when you have a strong preference dataset but limited rollout budget. Chapter 12 cautions that β must be tuned carefully to avoid overfitting and that high-quality references remain critical.
Analogy: Debate Judge & Apprentice
Imagine a debate judge who already evaluated pairs of arguments. The apprentice rereads the transcripts and adjusts their wording to match the judge’s favourite picks. No new debates are hosted. DPO is that offline practice loop.
Debate judge
Keeps a transcript of human-preferred answers and nudges the apprentice to argue like the winner without ever training a separate reward model.
Apprentice debater
Adjusts wording directly to match the judge’s notes. DPO’s gradients push the apprentice to copy human choices while staying close to the reference model.
DPO Playground
Experiment with the β temperature, optional margins, and logit gaps to see how DPO weights each comparison. Then inspect the loss curve compared to the reference policy.
DPO weighting playground
Adjust β, policy logits, and margin to see how DPO emphasises chosen vs. rejected samples without a reward model.
Parameters
Chosen completion
Polite answer with concrete next steps.
πθ(y|x) logprob ≈ -0.60 vs. πref(y|x) ≈ -1.00
β · Δ = 0.04
DPO multiplies the gradient for each sample by σ(β Δ). This illustration mirrors the weighting described in Chapter 12’s derivative form.
Rejected completion
Dismissive answer that ignores the request.
πθ(y|x) logprob ≈ -1.40 vs. πref(y|x) ≈ -1.00
β · Δ = -0.04
DPO multiplies the gradient for each sample by σ(β Δ). This illustration mirrors the weighting described in Chapter 12’s derivative form.
DPO loss curve
Inspect how the negative log-sigmoid term behaves as the policy moves away from the reference, echoing the derivation in Chapter 12.
Parameters
Minimum loss ≈ 0.513 when the policy aligns with human preference. Large negative deltas (rewarding the rejected response) push the loss rapidly upward, mirroring the cautionary notes in Chapter 12 about preference displacement.
Chapter 12 provides a concise implementation sketch. Below is a PyTorch-style training step that mirrors the book’s guidance: compute log-probs for the chosen and rejected completions, subtract the reference log-probs, and apply the log-sigmoid loss.
import torch
import torch.nn.functional as F
def dpo_step(model, ref_logits, batch, beta=0.1):
logp_chosen = model.log_probs(batch['input_ids_chosen'], batch['attention_mask'])
logp_rejected = model.log_probs(batch['input_ids_rejected'], batch['attention_mask'])
ref_chosen = ref_logits['chosen']
ref_rejected = ref_logits['rejected']
delta = (logp_chosen - logp_rejected) - (ref_chosen - ref_rejected)
loss = -F.logsigmoid(beta * delta).mean()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
return loss.item()
Operational Notes
- DPO optimises policy log-probabilities directly, avoiding a reward model but retaining the KL tether to the reference.
- β controls the trade-off between staying close to the reference and matching human preferences—large β can cause preference displacement.
- Variants such as IPO, cDPO, and REBEL add margins or calibration to mitigate dataset biases (Chapter 12.2).
- Offline processing reduces compute but depends on high-quality preference comparisons and good reference models.
- Calibrated evaluation (RewardBench, AlpacaEval with length correction) remains necessary to catch displacement effects.
DPO Check
Quiz yourself on DPO’s formulation, β tuning, and practical concerns from Chapter 12.
Answered 0/5 · Correct 0/5
- 1
Which expression matches the DPO objective from Chapter 12?
- 2
What role does β play in DPO?
- 3
What is preference displacement?
- 4
How does DPO differ from PPO in data usage?
- 5
Which variant extends DPO with calibration against a reward model?