Skip to main content
Creative Writing Student

Reward Modeling

Teach a reward model to score completions using pairwise human preference datasets.

Estimated time
35 minutes
Difficulty
intermediate
Prerequisites
1 module(s)
Equation

Pairwise Preference Loss

Reward models turn editor-style comparisons into a scalar score. Following the Bradley–Terry derivation in Chapter 7 of the RLHF book, we treat preferred and rejected completions $(y_w, y_l)$ for the same prompt and maximise the likelihood that the preferred sample wins:

P(yw>yl)=exp(rθ(x,yw))exp(rθ(x,yw))+exp(rθ(x,yl))P(y_w > y_l) = \frac{\exp(r_\theta(x, y_w))}{\exp(r_\theta(x, y_w)) + \exp(r_\theta(x, y_l))}

Minimising the negative log-likelihood yields the familiar logistic loss used by OpenAI, Anthropic, and Meta:

L(θ)=log(σ(rθ(x,yw)rθ(x,yl)))=log(1+erθ(x,yl)rθ(x,yw))\mathcal{L}(\theta) = -\log\big(\sigma(r_\theta(x, y_w) - r_\theta(x, y_l))\big) = \log\big(1 + e^{r_\theta(x, y_l) - r_\theta(x, y_w)}\big)

The two forms above are numerically identical. They encourage the model to amplify the reward gap whenever humans have a consistent preference.

Intuition

From Comparisons to Scores

Annotators rarely supply token-level supervision. Instead, Chapter 7 describes a workflow where editors mark which answer better reflects policy, tone, or correctness. Reward modeling teaches a lightweight head on top of the language model to reproduce these judgments and generalise them to unseen prompts.

The logits from this head are not probabilities about truth; they are learned proxies for “what annotators would pick”. That proxy becomes the objective for RLHF or rejection sampling. Because the dataset is small, most teams regularise by mixing in instruction-tuning style data, running short training schedules, and verifying performance on held-out comparisons.

Chapter 7 emphasises three sub-flavours of reward models:

  1. Standard preference models – score whole completions using pairwise comparisons.

  2. Outcome reward models (ORMs) – estimate answer correctness probabilities.

  3. Process reward models (PRMs) – score intermediate steps in a chain-of-thought trace.

Each behaves differently during optimisation, but all share the Bradley–Terry style loss.

Analogy

Analogy: Writing Student & Editor

The RLHF book repeatedly leans on the editor metaphor: a writing student drafts alternate endings while an editor marks their favourite. The student copies the editor’s reasoning until the editor and student agree nearly every time.

Creative writing student

Drafts multiple completions and experiments with tone. The reward model watches how the editor scores each revision in Chapter 7 of the RLHF book.

Editor mentor

Annotates pairwise comparisons with notes on safety, clarity, and style. The Bradley–Terry loss teaches the model to agree with these judgments.

By learning the editor’s pairwise judgments, the reward model becomes a proxy editor. Downstream optimisation can then query that proxy instead of asking the human mentor for every revision.

Visualization

Reward Model Explorer

Explore how weighting different quality axes changes which completion a reward model prefers, then visualise how the Bradley–Terry probability and loss respond to the reward gap.

Preference comparison playground

Explore how changing the weighting between helpfulness, safety, and style influences which completion a reward model prefers. The scenarios mirror editor-style judgments described in Chapter 7 of the RLHF book.

Parameters

Adjust the emphasis on each quality dimension. Remaining weight automatically flows into helpfulness to reflect scarce annotations.

Human preferred: B (from Chapter 7 annotation examples).

Interactive visualization

Reward model loss explorer

Adjust the reward gap between a preferred and rejected completion to see the Bradley–Terry probability and the corresponding negative log-likelihood loss used in Chapter 7 of the RLHF book.

Parameters

σ(Δ) = 0.77, loss = 0.263

Interactive visualization

Chapter 7 presents a concise implementation of the loss:

loss = -nn.functional.logsigmoid(reward_chosen - reward_rejected).mean(). Below is a slightly expanded PyTorch snippet with gradient clipping and single-epoch training as recommended in the text.

import torch
import torch.nn.functional as F

def train_reward_model(model, dataloader, optimiser):
    model.train()
    for inputs_chosen, inputs_rejected in dataloader:
        rewards_chosen = model(**inputs_chosen)
        rewards_rejected = model(**inputs_rejected)
        loss = -F.logsigmoid(rewards_chosen - rewards_rejected).mean()
        optimiser.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimiser.step()
    return loss.item()
Takeaways

Practice Patterns

  • Bradley–Terry preference modeling is the default loss; margin variants help when annotators supply graded scores.

  • Reward models typically fine-tune for a single epoch to avoid memorising small datasets.

  • Outcome and process reward models extend the paradigm to correctness signals and stepwise reasoning, respectively.

  • Evaluations such as RewardBench, M-RewardBench, and PRM Bench benchmark alignment across domains (Chapter 7.9).

  • Many teams increasingly use “LLM-as-a-judge” to bootstrap comparisons, but dedicated reward models still perform best on formal benchmarks.

Real-world deployments: Anthropic’s Constitutional AI, OpenAI’s InstructGPT, and Meta’s Llama 2 all rely on the loss derived above. Scaling trends in Chapter 7 show steady accuracy gains from larger models and carefully curated preference datasets.

Self-check

Reward Modeling Check

Confirm your understanding of the loss function, variants, and evaluation techniques from Chapter 7.

Answered 0/5 · Correct 0/5

  1. 1

    What probability does the reward model maximise during training according to the Bradley–Terry derivation?

  2. 2

    How does the negative log-likelihood loss behave when the model scores the rejected sample higher than the chosen sample?

  3. 3

    Why did Llama 2 experiment with a margin term m(r) inside the pairwise loss?

  4. 4

    What differentiates process reward models (PRMs) from standard preference models?

  5. 5

    Why do many teams stop reward model training after a single epoch?