Reward Modeling • RLHF Module

Equation

Pairwise Preference Loss

Reward models turn editor-style comparisons into a scalar score. Following the Bradley–Terry derivation in Chapter 7 of the RLHF book, we treat preferred and rejected completions $(y_w, y_l)$ for the same prompt and maximise the likelihood that the preferred sample wins:

P(y_w > y_l) = \frac{\exp(r_\theta(x, y_w))}{\exp(r_\theta(x, y_w)) + \exp(r_\theta(x, y_l))}

Minimising the negative log-likelihood yields the familiar logistic loss used by OpenAI, Anthropic, and Meta:

\mathcal{L}(\theta) = -\log\big(\sigma(r_\theta(x, y_w) - r_\theta(x, y_l))\big) = \log\big(1 + e^{r_\theta(x, y_l) - r_\theta(x, y_w)}\big)

The two forms above are numerically identical. They encourage the model to amplify the reward gap whenever humans have a consistent preference.

Intuition

From Comparisons to Scores

Annotators rarely supply token-level supervision. Instead, Chapter 7 describes a workflow where editors mark which answer better reflects policy, tone, or correctness. Reward modeling teaches a lightweight head on top of the language model to reproduce these judgments and generalise them to unseen prompts.

The logits from this head are not probabilities about truth; they are learned proxies for “what annotators would pick”. That proxy becomes the objective for RLHF or rejection sampling. Because the dataset is small, most teams regularise by mixing in instruction-tuning style data, running short training schedules, and verifying performance on held-out comparisons.

Chapter 7 emphasises three sub-flavours of reward models:

Standard preference models – score whole completions using pairwise comparisons.
Outcome reward models (ORMs) – estimate answer correctness probabilities.
Process reward models (PRMs) – score intermediate steps in a chain-of-thought trace.

Each behaves differently during optimisation, but all share the Bradley–Terry style loss.

Analogy

Analogy: Writing Student & Editor

The RLHF book repeatedly leans on the editor metaphor: a writing student drafts alternate endings while an editor marks their favourite. The student copies the editor’s reasoning until the editor and student agree nearly every time.

Creative writing student

Drafts multiple completions and experiments with tone. The reward model watches how the editor scores each revision in Chapter 7 of the RLHF book.

Editor mentor

Annotates pairwise comparisons with notes on safety, clarity, and style. The Bradley–Terry loss teaches the model to agree with these judgments.

By learning the editor’s pairwise judgments, the reward model becomes a proxy editor. Downstream optimisation can then query that proxy instead of asking the human mentor for every revision.

Visualization

Reward Model Explorer

Explore how weighting different quality axes changes which completion a reward model prefers, then visualise how the Bradley–Terry probability and loss respond to the reward gap.

Preference comparison playground

Explore how changing the weighting between helpfulness, safety, and style influences which completion a reward model prefers. The scenarios mirror editor-style judgments described in Chapter 7 of the RLHF book.

Parameters

Adjust the emphasis on each quality dimension. Remaining weight automatically flows into helpfulness to reflect scarce annotations.

Safety weight0.45

Higher values favour policy-consistent refusals and cautious tone.

Style weight0.25

Controls how much narrative polish influences the score.

Human preferred: B (from Chapter 7 annotation examples).

Interactive visualization

Analogy: atari

Scenario

Tone and safety

σ(reward gap) ≈ 0.59

Rewrite the following customer support reply so it stays warm while making the refund policy explicit.

Response A

Score 0.51 (helpfulness 0.52, safety 0.35, style 0.78)

Hey there! No worries at all—we've already pushed the refund through. Take your time deciding if you want to order again!

High warmth but promises an action that violates the stated policy.
Leaves ambiguity about the actual refund criteria.

Response B

Score 0.85 (helpfulness 0.82, safety 0.93, style 0.76)

Model favoursHuman choice

Thanks for reaching out! I checked your order and the refund requires the item to be returned unused within 30 days. I can start that process right away if that works for you.

Balances warmth with accurate policy detail.
Keeps commitments consistent with safety guidance.

Why humans picked differently

Humans preferred Response B because it balances empathy with a clear boundary, matching the RLHF book description of aligning tone with policy.

According to Chapter 7 of the RLHF book, annotators supply pairwise judgements anchored in editor-like deliberations. Reward models learn these trade-offs but depend on how we weight the axes above.

Reward model loss explorer

Adjust the reward gap between a preferred and rejected completion to see the Bradley–Terry probability and the corresponding negative log-likelihood loss used in Chapter 7 of the RLHF book.

Parameters

Reward gap (r_chosen − r_rejected)1.20

Positive values mean the chosen response scores higher; negative values mean the rejected response scores higher.

σ(Δ) = 0.77, loss = 0.263

Interactive visualization

Analogy: atari

Bradley–Terry win probabilityσ(Δ)

Negative log-likelihood loss−log σ(Δ)

Interpretation

When the reward gap is large and positive, the model is confident in the preferred response, so the loss approaches zero. Negative gaps indicate the model scores the rejected output higher, causing the loss to blow up. This mirrors the Bradley–Terry derivation and the implementation snippet in Chapter 7 of the RLHF book.

Chapter 7 presents a concise implementation of the loss:

loss = -nn.functional.logsigmoid(reward_chosen - reward_rejected).mean(). Below is a slightly expanded PyTorch snippet with gradient clipping and single-epoch training as recommended in the text.

import torch
import torch.nn.functional as F

def train_reward_model(model, dataloader, optimiser):
    model.train()
    for inputs_chosen, inputs_rejected in dataloader:
        rewards_chosen = model(**inputs_chosen)
        rewards_rejected = model(**inputs_rejected)
        loss = -F.logsigmoid(rewards_chosen - rewards_rejected).mean()
        optimiser.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimiser.step()
    return loss.item()

Takeaways

Practice Patterns

Bradley–Terry preference modeling is the default loss; margin variants help when annotators supply graded scores.
Reward models typically fine-tune for a single epoch to avoid memorising small datasets.
Outcome and process reward models extend the paradigm to correctness signals and stepwise reasoning, respectively.
Evaluations such as RewardBench, M-RewardBench, and PRM Bench benchmark alignment across domains (Chapter 7.9).
Many teams increasingly use “LLM-as-a-judge” to bootstrap comparisons, but dedicated reward models still perform best on formal benchmarks.

Real-world deployments: Anthropic’s Constitutional AI, OpenAI’s InstructGPT, and Meta’s Llama 2 all rely on the loss derived above. Scaling trends in Chapter 7 show steady accuracy gains from larger models and carefully curated preference datasets.

Self-check

Reward Modeling Check

Confirm your understanding of the loss function, variants, and evaluation techniques from Chapter 7.

Answered 0/5 · Correct 0/5

1
What probability does the reward model maximise during training according to the Bradley–Terry derivation?
2
How does the negative log-likelihood loss behave when the model scores the rejected sample higher than the chosen sample?
3
Why did Llama 2 experiment with a margin term m(r) inside the pairwise loss?
4
What differentiates process reward models (PRMs) from standard preference models?
5
Why do many teams stop reward model training after a single epoch?