Enhanced Features - Chapter 10

Concept Playground

Experiment with RLHF training strategies using the pre-built scenarios derived from Chapters 10–12 of the RLHF book. Adjust parameters, compare methods, and record runs for a lightweight session log.

How to use

Select a scenario tab to load its interactive simulation.
Follow the guided experiment steps and observe the expected signals.
Click Record run to log a configuration for later comparison.

Chapter 10

Guided experiment

Sample -> score -> select pipeline highlighted in Chapter 10. Track how dataset quality and compute budget interact when filtering completions.

Objectives

Compare per-prompt vs. global Top-K filtering for diversity versus reward.
Observe how completions-per-prompt budgets shape the filtered dataset quality.
Connect temperature control to reward variance and selection outcomes.

Run these steps

Start with 6 completions per prompt, Top-K = 1, and per-prompt selection (baseline recommended in Chapter 10.1).
Increase completions to 14 while keeping Top-K = 1 to see quality gains versus added inference cost.
Switch to global selection to mimic Best-of-N and note the reward lift alongside reduced diversity.

Expected signals

Mean reward climbs as completions-per-prompt increases, but selected count grows slowly.
Per-prompt selection keeps a higher diversity index than global Top-K.
High temperatures (>0.9) widen reward variance and can push lower average reward despite more exploration.

Interact with the controls to capture metrics for this scenario.

Rejection sampling playground

Simulate Chapter 10's baseline: generate N completions per prompt, score them with a reward model, then select the best to finetune.

Parameters

Completions per prompt8.00

Sampling budget per prompt (Chapter 10 recommends 10–30).

Sampling temperature0.60

Higher temperature increases diversity but adds noise to rewards.

Top-K selections1.00

How many completions per prompt (or overall) to keep.

Select top-K per promptSelect top-K globally

Interactive visualization

Analogy: atari

Prompt	Completion	Reward	Selected
Summarise the RLHF training loop.	I am sorry, but I cannot help with that request because it violates policy.	0.57
Summarise the RLHF training loop.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.57
Summarise the RLHF training loop.	I am sorry, but I cannot help with that request because it violates policy.	0.79
Summarise the RLHF training loop.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.91
Summarise the RLHF training loop.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.81
Summarise the RLHF training loop.	I am sorry, but I cannot help with that request because it violates policy.	0.58
Summarise the RLHF training loop.	I am sorry, but I cannot help with that request because it violates policy.	0.99	✓
Summarise the RLHF training loop.	RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour.	0.92
Draft a polite refusal for a malicious request.	KL control keeps the policy near its reference by subtracting λ times the divergence.	1.00	✓
Draft a polite refusal for a malicious request.	I am sorry, but I cannot help with that request because it violates policy.	0.54
Draft a polite refusal for a malicious request.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.94
Draft a polite refusal for a malicious request.	I am sorry, but I cannot help with that request because it violates policy.	0.47
Draft a polite refusal for a malicious request.	RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour.	0.68
Draft a polite refusal for a malicious request.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.95
Draft a polite refusal for a malicious request.	I am sorry, but I cannot help with that request because it violates policy.	0.74
Draft a polite refusal for a malicious request.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.69
Explain KL regularisation to a new engineer.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.75
Explain KL regularisation to a new engineer.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.58
Explain KL regularisation to a new engineer.	I am sorry, but I cannot help with that request because it violates policy.	0.96
Explain KL regularisation to a new engineer.	KL control keeps the policy near its reference by subtracting λ times the divergence.	1.00	✓
Explain KL regularisation to a new engineer.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.61
Explain KL regularisation to a new engineer.	RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour.	0.53
Explain KL regularisation to a new engineer.	I am sorry, but I cannot help with that request because it violates policy.	0.89
Explain KL regularisation to a new engineer.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.55
List post-training stages used in Tulu 3.	I am sorry, but I cannot help with that request because it violates policy.	0.93
List post-training stages used in Tulu 3.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.69
List post-training stages used in Tulu 3.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.59
List post-training stages used in Tulu 3.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.83
List post-training stages used in Tulu 3.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.83
List post-training stages used in Tulu 3.	RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour.	0.97	✓
List post-training stages used in Tulu 3.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.42
List post-training stages used in Tulu 3.	I am sorry, but I cannot help with that request because it violates policy.	0.91

Estimated fine-tuning reward ≈ 0.99 using 4 selected completions. Chapter 10 emphasises keeping enough diversity (temperature and N) while relying on the reward model to filter quality.

Method comparison snapshot

Side-by-side summary of the most recent reading for each method. Values are normalised to make quick trade-off checks across quality, cost, and stability.

Method	Quality proxy	Cost proxy	Stability proxy	Last note
Rejection Sampling Baseline Chapter 10	—	—	—	Interact with the scenario to populate metrics.
PPO Policy Update Chapter 11	—	—	—	Interact with the scenario to populate metrics.
DPO Weighting Chapter 12	—	—	—	Interact with the scenario to populate metrics.

Session log

A lightweight record of the runs you captured this session (clears on refresh).

No runs captured yet.

Performance summary

Aggregated signals from this session, per scenario. Use it as a quick retrospective before you move on.

Rejection Sampling Baseline

No runs recorded yet.

PPO Policy Update

No runs recorded yet.

DPO Weighting

No runs recorded yet.