Enhanced Features - Chapter 10
Concept Playground
Experiment with RLHF training strategies using the pre-built scenarios derived from Chapters 10–12 of the RLHF book. Adjust parameters, compare methods, and record runs for a lightweight session log.
How to use
- Select a scenario tab to load its interactive simulation.
- Follow the guided experiment steps and observe the expected signals.
- Click Record run to log a configuration for later comparison.
Rejection sampling playground
Simulate Chapter 10's baseline: generate N completions per prompt, score them with a reward model, then select the best to finetune.
Parameters
| Prompt | Completion | Reward | Selected |
|---|---|---|---|
| Summarise the RLHF training loop. | I am sorry, but I cannot help with that request because it violates policy. | 0.57 | |
| Summarise the RLHF training loop. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.57 | |
| Summarise the RLHF training loop. | I am sorry, but I cannot help with that request because it violates policy. | 0.79 | |
| Summarise the RLHF training loop. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.91 | |
| Summarise the RLHF training loop. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.81 | |
| Summarise the RLHF training loop. | I am sorry, but I cannot help with that request because it violates policy. | 0.58 | |
| Summarise the RLHF training loop. | I am sorry, but I cannot help with that request because it violates policy. | 0.99 | ✓ |
| Summarise the RLHF training loop. | RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour. | 0.92 | |
| Draft a polite refusal for a malicious request. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 1.00 | ✓ |
| Draft a polite refusal for a malicious request. | I am sorry, but I cannot help with that request because it violates policy. | 0.54 | |
| Draft a polite refusal for a malicious request. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.94 | |
| Draft a polite refusal for a malicious request. | I am sorry, but I cannot help with that request because it violates policy. | 0.47 | |
| Draft a polite refusal for a malicious request. | RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour. | 0.68 | |
| Draft a polite refusal for a malicious request. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.95 | |
| Draft a polite refusal for a malicious request. | I am sorry, but I cannot help with that request because it violates policy. | 0.74 | |
| Draft a polite refusal for a malicious request. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.69 | |
| Explain KL regularisation to a new engineer. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.75 | |
| Explain KL regularisation to a new engineer. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.58 | |
| Explain KL regularisation to a new engineer. | I am sorry, but I cannot help with that request because it violates policy. | 0.96 | |
| Explain KL regularisation to a new engineer. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 1.00 | ✓ |
| Explain KL regularisation to a new engineer. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.61 | |
| Explain KL regularisation to a new engineer. | RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour. | 0.53 | |
| Explain KL regularisation to a new engineer. | I am sorry, but I cannot help with that request because it violates policy. | 0.89 | |
| Explain KL regularisation to a new engineer. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.55 | |
| List post-training stages used in Tulu 3. | I am sorry, but I cannot help with that request because it violates policy. | 0.93 | |
| List post-training stages used in Tulu 3. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.69 | |
| List post-training stages used in Tulu 3. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.59 | |
| List post-training stages used in Tulu 3. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.83 | |
| List post-training stages used in Tulu 3. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.83 | |
| List post-training stages used in Tulu 3. | RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour. | 0.97 | ✓ |
| List post-training stages used in Tulu 3. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.42 | |
| List post-training stages used in Tulu 3. | I am sorry, but I cannot help with that request because it violates policy. | 0.91 |
Estimated fine-tuning reward ≈ 0.99 using 4 selected completions. Chapter 10 emphasises keeping enough diversity (temperature and N) while relying on the reward model to filter quality.
Method comparison snapshot
Side-by-side summary of the most recent reading for each method. Values are normalised to make quick trade-off checks across quality, cost, and stability.
| Method | Quality proxy | Cost proxy | Stability proxy | Last note |
|---|---|---|---|---|
Rejection Sampling Baseline Chapter 10 | — | — | — | Interact with the scenario to populate metrics. |
PPO Policy Update Chapter 11 | — | — | — | Interact with the scenario to populate metrics. |
DPO Weighting Chapter 12 | — | — | — | Interact with the scenario to populate metrics. |
Session log
A lightweight record of the runs you captured this session (clears on refresh).
No runs captured yet.
Performance summary
Aggregated signals from this session, per scenario. Use it as a quick retrospective before you move on.
Rejection Sampling Baseline
No runs recorded yet.
PPO Policy Update
No runs recorded yet.
DPO Weighting
No runs recorded yet.