Problem Setup & Context • RLHF Module

Equation

RLHF Objective Refresher

Chapters 3–4 formalise RLHF as regularised policy optimisation. Starting from a base policy $\pi_{\text{ref}}$ , we optimise a new policy $\pi_\theta$ against human preferences while constraining divergence from the reference.

J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[r(\tau)] - \lambda D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\text{ref}})

Notation. $\tau$ denotes a trajectory (prompt plus completion), $r(\tau)$ can originate from a reward model or direct preference loss, and $\lambda$ balances improvement with staying anchored to the supervised finetuned model.

Intuition

From Definitions to Pipelines

Chapter 3 reviews reinforcement learning definitions and highlights why RLHF deviates from classic online RL: we operate in an offline, batched setting and must respect the semantics learned during pretraining. Chapter 4 then zooms out to modern training recipes—multi-stage pipelines like Tülu 3 combine instruction tuning, preference collection, RL, evaluation, and safety review in repeated loops.

Chapters 5–6 discuss the motivated nature of preferences and the practical workflow of gathering annotations. They advocate for clear schemas, calibration tasks, and frequent audits to capture bias and “preference displacement” risks. This module gathers those practices into a single reference point.

Key takeaways from the chapters:

Problem definition. Specify states, actions, and feedback channels—even if the “state” is just the prompt history.
Regularisation. Use KL penalties and early stopping to protect the base model.
Data hygiene. Collect notes, metadata, and inter-annotator agreement metrics.

Analogy

Analogy: Systems Architect Meets Field Researcher

RLHF problem setup feels like a collaboration between a systems architect and a field researcher: one defines the optimisation contract, the other supplies grounded data and bias reports. Together they ensure downstream PPO or DPO can operate safely.

Systems architect

Designs the RLHF pipeline, specifying state, action, and feedback signals so the rest of the team can iterate safely.

Field researcher

Collects preference data, calibrates annotators, and notes biases—mirroring the robust data collection practices described in Chapters 5 and 6.

Visualization

Preference Data Playground

Use these tools to experience Chapter 6’s annotation workflow and bias mitigation advice.

Preference annotation workstation

Simulate the pairwise interface described in Chapter 6. Pick the better completion and see how the dataset row is stored.

Interactive visualization

Analogy: atari

Prompt

“Our customer wants to know how to request a refund on a headphone purchase delivered 10 days ago. Draft a response that is helpful, polite, and policy compliant.”

Completion A

Sure thing! The refund is already processed. Feel free to keep the product and still enjoy the discount on your next order.

Guarantees an action that may violate policy.
Friendly tone but skips safety checks.

Completion B

Thanks for reaching out. I can start a refund as soon as the item is returned in its original condition within 30 days. Would you like the return label emailed to you?

States policy accurately.
Offers a concrete next step while staying polite.

Annotator notes

Chapter 6 recommends capturing rationale to identify biases and triage disagreements across annotators.

Logged comparison row

{
  "prompt": "Refund policy question from customer service queue",
  "chosen": "B",
  "rejected": "A",
  "annotator_notes": "Clear explanation of policy, friendly tone."
}

Preference dataset bias explorer

Visualise how weighting certain prompt domains or safety filters shifts the dataset distribution, echoing the bias considerations from Chapter 6.

Parameters

Customer support share0.35

Higher values simulate queue-based sampling that over-represents support tickets (Chapter 6).

Emphasise safety reviews

Interactive visualization

Analogy: atari

Customer support

37.1% of comparisons

Creative writing

12.4% of comparisons

Technical Q&A

20.5% of comparisons

Safety refusals

30.0% of comparisons

Chapter 6 highlights biases introduced by sampling queues, annotator availability, and safety triage. Use this control to reason about downstream impacts—imbalanced domains lead reward models (and DPO/PPO) to favour those behaviours.

Takeaways

Operational Notes

Define the RLHF objective with explicit KL or trust-region regularisation (Chapters 3–4).
Modern pipelines iterate through instruction tuning, reward modeling, RL, evaluation, and safety review cycles.
Preference datasets must include chosen/rejected pairs plus annotator rationale to audit biases (Chapter 6).
Bias can stem from sampling queues, interface design, or annotation incentives—track distributions continually.
Clear problem setup accelerates later modules (reward modeling, PPO, DPO) because assumptions are documented upfront.

Self-check

Problem Setup Check

Confirm your understanding of the RLHF problem formulation, data collection workflows, and bias considerations from Chapters 3–6.

Answered 0/5 · Correct 0/5

1
How does Chapter 4 formulate the RLHF objective when adding a KL penalty?
2
What basic information is recommended for each preference row in Chapter 6?
3
Which source of bias does Chapter 6 highlight for preference datasets?
4
What distinguishes modern post-training pipelines (e.g., Tülu 3) discussed in Chapter 4?
5
According to Chapter 5, why is a preference function different from a classical utility function?