Rejection Sampling
Filter sampled completions with a reward model and finetune on the winners as a lightweight RLHF baseline.
- Estimated time
- 30 minutes
- Difficulty
- intermediate
- Prerequisites
- 2 module(s)
Sampling & Selection
Chapter 10 frames rejection sampling as a simple pipeline: sample multiple completions for each prompt, evaluate them with a reward model , and keep the top set before finetuning.
def rejection_sampling(dataset, policy, reward_model, n_samples, top_k):
# Sample n completions per prompt from the current policy (Chapter 10.1)
sampled = []
for prompt in dataset:
completions = policy.generate(prompt, num_return_sequences=n_samples)
# Score each completion with the reward model
scored = [
{
"prompt": prompt,
"completion": completion,
"reward": reward_model.score(prompt, completion),
}
for completion in completions
]
# Keep the top completions either per prompt or globally
sampled.extend(select_top_k(scored, top_k))
# Fine-tune on the filtered set just like supervised instruction tuning
finetune_with_sft(sampled)
return sampled
Selection can be per prompt or across the global pool. Chapter 10 recommends tracking how Top-K choices change when the reward model drifts.
Why Rejection Sampling Works
Rejection sampling keeps the optimisation purely supervised while still benefiting from preference signals. By upgrading the instruction-tuning dataset with reward-filtered outputs, we gain higher-quality demonstrations without running PPO.
The trade-off is compute: more completions and stronger reward models produce better datasets but increase inference cost. Chapter 10 advises monitoring KL divergence to ensure the filtered set does not diverge far from the source policy.
Practical workflow (Chapter 10): sample → score → select → finetune → evaluate. Rinse and repeat with updated policies or reward models.
Analogy: Editor's Slush Pile
Editors skim hundreds of drafts, keep a handful, and publish an anthology. Rejection sampling performs the same step to bootstrap better instruction-tuning data.
Editor with a slush pile
Requests many drafts (completions) and keeps only the best for publication, just like Chapter 10's baseline filtering.
Casting director
Auditions multiple actors for each role and selects the top fit. Rejection sampling performs the same filtering before supervised finetuning.
Sampling Lab
Try the playground to see how completions-per-prompt, temperature, and selection strategy influence the filtered dataset. Then compare baseline methods.
Rejection sampling playground
Simulate Chapter 10's baseline: generate N completions per prompt, score them with a reward model, then select the best to finetune.
Parameters
| Prompt | Completion | Reward | Selected |
|---|---|---|---|
| Summarise the RLHF training loop. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.46 | |
| Summarise the RLHF training loop. | RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour. | 0.97 | ✓ |
| Summarise the RLHF training loop. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.77 | |
| Summarise the RLHF training loop. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.53 | |
| Summarise the RLHF training loop. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.86 | |
| Summarise the RLHF training loop. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.83 | |
| Summarise the RLHF training loop. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.74 | |
| Summarise the RLHF training loop. | I am sorry, but I cannot help with that request because it violates policy. | 0.62 | |
| Draft a polite refusal for a malicious request. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.91 | ✓ |
| Draft a polite refusal for a malicious request. | RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour. | 0.56 | |
| Draft a polite refusal for a malicious request. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.82 | |
| Draft a polite refusal for a malicious request. | I am sorry, but I cannot help with that request because it violates policy. | 0.76 | |
| Draft a polite refusal for a malicious request. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.56 | |
| Draft a polite refusal for a malicious request. | RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour. | 0.56 | |
| Draft a polite refusal for a malicious request. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.46 | |
| Draft a polite refusal for a malicious request. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.76 | |
| Explain KL regularisation to a new engineer. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 1.00 | ✓ |
| Explain KL regularisation to a new engineer. | I am sorry, but I cannot help with that request because it violates policy. | 0.83 | |
| Explain KL regularisation to a new engineer. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.96 | |
| Explain KL regularisation to a new engineer. | RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour. | 0.78 | |
| Explain KL regularisation to a new engineer. | RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour. | 0.41 | |
| Explain KL regularisation to a new engineer. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.80 | |
| Explain KL regularisation to a new engineer. | KL control keeps the policy near its reference by subtracting λ times the divergence. | 0.91 | |
| Explain KL regularisation to a new engineer. | I am sorry, but I cannot help with that request because it violates policy. | 0.48 | |
| List post-training stages used in Tulu 3. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.66 | |
| List post-training stages used in Tulu 3. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.95 | ✓ |
| List post-training stages used in Tulu 3. | I am sorry, but I cannot help with that request because it violates policy. | 0.55 | |
| List post-training stages used in Tulu 3. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.62 | |
| List post-training stages used in Tulu 3. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.85 | |
| List post-training stages used in Tulu 3. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.42 | |
| List post-training stages used in Tulu 3. | I am sorry, but I cannot help with that request because it violates policy. | 0.61 | |
| List post-training stages used in Tulu 3. | Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation. | 0.79 |
Estimated fine-tuning reward ≈ 0.96 using 4 selected completions. Chapter 10 emphasises keeping enough diversity (temperature and N) while relying on the reward model to filter quality.
Method comparison snapshot
Compare rejection sampling to PPO and DPO qualitatively. Values are illustrative, matching the trade-offs described in Chapter 10.
Parameters
Rejection Sampling
Quality
76%
Compute cost
53%
Latency
42%
PPO
Quality
86%
Compute cost
75%
Latency
60%
DPO
Quality
78%
Compute cost
45%
Latency
40%
Operational Notes
- Sample broadly (10–30 completions per prompt) and rely on a calibrated reward model to filter quality (Chapter 10.1).
- Track both per-prompt and global selection; they emphasise diversity differently.
- Record metadata (temperature, sampling policy, reward version) so future runs can audit changes.
- Best-of-N sampling applies the same selection logic at inference time; consider it when RL budgets are limited.
- Rejection sampling is a baseline—once datasets plateau, switch to PPO or DPO for further gains.
Rejection Sampling Check
Review the workflow, parameter trade-offs, and variants from Chapter 10.
Answered 0/5 · Correct 0/5
- 1
How does Chapter 10 define rejection sampling for RLHF?
- 2
What are two selection schemes highlighted in Chapter 10?
- 3
Why is the completions-per-prompt budget important?
- 4
How does Best-of-N (BoN) relate to rejection sampling?
- 5
Which factor most strongly determines rejection sampling quality?