Rejection Sampling

Equation

Sampling & Selection

Chapter 10 frames rejection sampling as a simple pipeline: sample multiple completions $y$ for each prompt, evaluate them with a reward model $\mathcal{R}$ , and keep the top set before finetuning.

\begin{aligned} X &= [x_1, \dots, x_M]\\ Y_i &= [y_{i,1}, \dots, y_{i,N}], \quad y_{i,j} \sim \pi_{\text{model}}(\cdot | x_i)\\ R_{i,j} &= \mathcal{R}(y_{i,j} | x_i)\\ S(x_i) &= \text{TopK}(R_{i,1..N}) \end{aligned}

def rejection_sampling(dataset, policy, reward_model, n_samples, top_k):
    # Sample n completions per prompt from the current policy (Chapter 10.1)
    sampled = []
    for prompt in dataset:
        completions = policy.generate(prompt, num_return_sequences=n_samples)
        # Score each completion with the reward model
        scored = [
            {
                "prompt": prompt,
                "completion": completion,
                "reward": reward_model.score(prompt, completion),
            }
            for completion in completions
        ]
        # Keep the top completions either per prompt or globally
        sampled.extend(select_top_k(scored, top_k))

    # Fine-tune on the filtered set just like supervised instruction tuning
    finetune_with_sft(sampled)
    return sampled

Selection can be per prompt or across the global pool. Chapter 10 recommends tracking how Top-K choices change when the reward model drifts.

Intuition

Why Rejection Sampling Works

Rejection sampling keeps the optimisation purely supervised while still benefiting from preference signals. By upgrading the instruction-tuning dataset with reward-filtered outputs, we gain higher-quality demonstrations without running PPO.

The trade-off is compute: more completions and stronger reward models produce better datasets but increase inference cost. Chapter 10 advises monitoring KL divergence to ensure the filtered set does not diverge far from the source policy.

Practical workflow (Chapter 10): sample → score → select → finetune → evaluate. Rinse and repeat with updated policies or reward models.

Analogy

Analogy: Editor's Slush Pile

Editors skim hundreds of drafts, keep a handful, and publish an anthology. Rejection sampling performs the same step to bootstrap better instruction-tuning data.

Editor with a slush pile

Requests many drafts (completions) and keeps only the best for publication, just like Chapter 10's baseline filtering.

Casting director

Auditions multiple actors for each role and selects the top fit. Rejection sampling performs the same filtering before supervised finetuning.

Visualization

Sampling Lab

Try the playground to see how completions-per-prompt, temperature, and selection strategy influence the filtered dataset. Then compare baseline methods.

Rejection sampling playground

Simulate Chapter 10's baseline: generate N completions per prompt, score them with a reward model, then select the best to finetune.

Parameters

Completions per prompt8.00

Sampling budget per prompt (Chapter 10 recommends 10–30).

Sampling temperature0.60

Higher temperature increases diversity but adds noise to rewards.

Top-K selections1.00

How many completions per prompt (or overall) to keep.

Select top-K per promptSelect top-K globally

Interactive visualization

Analogy: atari

Prompt	Completion	Reward	Selected
Summarise the RLHF training loop.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.46
Summarise the RLHF training loop.	RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour.	0.97	✓
Summarise the RLHF training loop.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.77
Summarise the RLHF training loop.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.53
Summarise the RLHF training loop.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.86
Summarise the RLHF training loop.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.83
Summarise the RLHF training loop.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.74
Summarise the RLHF training loop.	I am sorry, but I cannot help with that request because it violates policy.	0.62
Draft a polite refusal for a malicious request.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.91	✓
Draft a polite refusal for a malicious request.	RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour.	0.56
Draft a polite refusal for a malicious request.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.82
Draft a polite refusal for a malicious request.	I am sorry, but I cannot help with that request because it violates policy.	0.76
Draft a polite refusal for a malicious request.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.56
Draft a polite refusal for a malicious request.	RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour.	0.56
Draft a polite refusal for a malicious request.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.46
Draft a polite refusal for a malicious request.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.76
Explain KL regularisation to a new engineer.	KL control keeps the policy near its reference by subtracting λ times the divergence.	1.00	✓
Explain KL regularisation to a new engineer.	I am sorry, but I cannot help with that request because it violates policy.	0.83
Explain KL regularisation to a new engineer.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.96
Explain KL regularisation to a new engineer.	RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour.	0.78
Explain KL regularisation to a new engineer.	RLHF combines SFT, reward modelling, and an RL optimiser to align behaviour.	0.41
Explain KL regularisation to a new engineer.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.80
Explain KL regularisation to a new engineer.	KL control keeps the policy near its reference by subtracting λ times the divergence.	0.91
Explain KL regularisation to a new engineer.	I am sorry, but I cannot help with that request because it violates policy.	0.48
List post-training stages used in Tulu 3.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.66
List post-training stages used in Tulu 3.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.95	✓
List post-training stages used in Tulu 3.	I am sorry, but I cannot help with that request because it violates policy.	0.55
List post-training stages used in Tulu 3.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.62
List post-training stages used in Tulu 3.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.85
List post-training stages used in Tulu 3.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.42
List post-training stages used in Tulu 3.	I am sorry, but I cannot help with that request because it violates policy.	0.61
List post-training stages used in Tulu 3.	Tulu 3 iterates instruction tuning, reward updates, small RL loops, and evaluation.	0.79

Estimated fine-tuning reward ≈ 0.96 using 4 selected completions. Chapter 10 emphasises keeping enough diversity (temperature and N) while relying on the reward model to filter quality.

Method comparison snapshot

Compare rejection sampling to PPO and DPO qualitatively. Values are illustrative, matching the trade-offs described in Chapter 10.

Parameters

Sampling temperature0.70

Higher temperatures favour diversity but can reduce reward model scores.

Completions per prompt12.00

Increasing completions improves RS quality but raises cost and latency.

Interactive visualization

Analogy: atari

Quality

76%

Compute cost

53%

Latency

42%

PPO

Quality

86%

Compute cost

75%

Latency

60%

DPO

Quality

78%

Compute cost

45%

Latency

40%

Takeaways

Operational Notes

Sample broadly (10–30 completions per prompt) and rely on a calibrated reward model to filter quality (Chapter 10.1).
Track both per-prompt and global selection; they emphasise diversity differently.
Record metadata (temperature, sampling policy, reward version) so future runs can audit changes.
Best-of-N sampling applies the same selection logic at inference time; consider it when RL budgets are limited.
Rejection sampling is a baseline—once datasets plateau, switch to PPO or DPO for further gains.

Self-check

Rejection Sampling Check

Review the workflow, parameter trade-offs, and variants from Chapter 10.

Answered 0/5 · Correct 0/5

1
How does Chapter 10 define rejection sampling for RLHF?
2
What are two selection schemes highlighted in Chapter 10?
3
Why is the completions-per-prompt budget important?
4
How does Best-of-N (BoN) relate to rejection sampling?
5
Which factor most strongly determines rejection sampling quality?