Skip to main content
Advanced Concepts

Direct Preference Optimization (DPO)

Optimise policies directly against preference data without training a reward model.

Estimated time
30 minutes
Difficulty
intermediate
Prerequisites
2 module(s)
Equation

DPO Objective

Direct Preference Optimization (DPO) solves the RLHF objective without a reward model. Chapter 12 shows that the optimal policy can be written as a tilt of the reference policy using the human preference dataset.

LDPO(πθ;πref)=E(x,yc,yr)D[logσ(β(logπθ(ycx)logπθ(yrx)logπref(ycx)+logπref(yrx)))]\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_c, y_r) \sim \mathcal{D}}\Big[\log \sigma\big(\beta \big(\log \pi_\theta(y_c|x) - \log \pi_\theta(y_r|x) - \log \pi_{\text{ref}}(y_c|x) + \log \pi_{\text{ref}}(y_r|x)\big)\big)\Big]

DPO’s gradient (Equation 66 in the book) scales the difference between chosen and rejected log-probabilities by σ(βΔ)\sigma(\beta \Delta), where Δ\Delta is the logit gap relative to the reference model. No separate reward model is required.

Intuition

Offline Alignment Intuition

Instead of training a reward model and running PPO, DPO asks: “Can we tilt the policy so that it prefers the human choices directly?” The loss above does exactly this, weighting gradient steps by how much the current model already agrees with annotators. When the model already prefers the human answer, the gradients shrink; when it still prefers the rejected answer, the gradients grow.

DPO shares the KL safety tether with PPO—it still keeps the policy near the reference—but enjoys a simpler training loop. Chapter 12 compares DPO to other direct alignment algorithms like IPO and cDPO that add margins, calibration, or additional weighting to address preference displacement.

Offline algorithms shine when you have a strong preference dataset but limited rollout budget. Chapter 12 cautions that β must be tuned carefully to avoid overfitting and that high-quality references remain critical.

Analogy

Analogy: Debate Judge & Apprentice

Imagine a debate judge who already evaluated pairs of arguments. The apprentice rereads the transcripts and adjusts their wording to match the judge’s favourite picks. No new debates are hosted. DPO is that offline practice loop.

Debate judge

Keeps a transcript of human-preferred answers and nudges the apprentice to argue like the winner without ever training a separate reward model.

Apprentice debater

Adjusts wording directly to match the judge’s notes. DPO’s gradients push the apprentice to copy human choices while staying close to the reference model.

Visualization

DPO Playground

Experiment with the β temperature, optional margins, and logit gaps to see how DPO weights each comparison. Then inspect the loss curve compared to the reference policy.

DPO weighting playground

Adjust β, policy logits, and margin to see how DPO emphasises chosen vs. rejected samples without a reward model.

Parameters

Interactive visualization

DPO loss curve

Inspect how the negative log-sigmoid term behaves as the policy moves away from the reference, echoing the derivation in Chapter 12.

Parameters

Interactive visualization

Chapter 12 provides a concise implementation sketch. Below is a PyTorch-style training step that mirrors the book’s guidance: compute log-probs for the chosen and rejected completions, subtract the reference log-probs, and apply the log-sigmoid loss.

import torch
import torch.nn.functional as F

def dpo_step(model, ref_logits, batch, beta=0.1):
    logp_chosen = model.log_probs(batch['input_ids_chosen'], batch['attention_mask'])
    logp_rejected = model.log_probs(batch['input_ids_rejected'], batch['attention_mask'])

    ref_chosen = ref_logits['chosen']
    ref_rejected = ref_logits['rejected']

    delta = (logp_chosen - logp_rejected) - (ref_chosen - ref_rejected)
    loss = -F.logsigmoid(beta * delta).mean()

    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    return loss.item()
Takeaways

Operational Notes

  • DPO optimises policy log-probabilities directly, avoiding a reward model but retaining the KL tether to the reference.
  • β controls the trade-off between staying close to the reference and matching human preferences—large β can cause preference displacement.
  • Variants such as IPO, cDPO, and REBEL add margins or calibration to mitigate dataset biases (Chapter 12.2).
  • Offline processing reduces compute but depends on high-quality preference comparisons and good reference models.
  • Calibrated evaluation (RewardBench, AlpacaEval with length correction) remains necessary to catch displacement effects.
Self-check

DPO Check

Quiz yourself on DPO’s formulation, β tuning, and practical concerns from Chapter 12.

Answered 0/5 · Correct 0/5

  1. 1

    Which expression matches the DPO objective from Chapter 12?

  2. 2

    What role does β play in DPO?

  3. 3

    What is preference displacement?

  4. 4

    How does DPO differ from PPO in data usage?

  5. 5

    Which variant extends DPO with calibration against a reward model?