Skip to main content
Creative Writing Student

Rejection Sampling

Filter sampled completions with a reward model and finetune on the winners as a lightweight RLHF baseline.

Estimated time
30 minutes
Difficulty
intermediate
Prerequisites
2 module(s)
Equation

Sampling & Selection

Chapter 10 frames rejection sampling as a simple pipeline: sample multiple completions yy for each prompt, evaluate them with a reward model R\mathcal{R}, and keep the top set before finetuning.

X=[x1,,xM]Yi=[yi,1,,yi,N],yi,jπmodel(xi)Ri,j=R(yi,jxi)S(xi)=TopK(Ri,1..N)\begin{aligned} X &= [x_1, \dots, x_M]\\ Y_i &= [y_{i,1}, \dots, y_{i,N}], \quad y_{i,j} \sim \pi_{\text{model}}(\cdot | x_i)\\ R_{i,j} &= \mathcal{R}(y_{i,j} | x_i)\\ S(x_i) &= \text{TopK}(R_{i,1..N}) \end{aligned}
def rejection_sampling(dataset, policy, reward_model, n_samples, top_k):
    # Sample n completions per prompt from the current policy (Chapter 10.1)
    sampled = []
    for prompt in dataset:
        completions = policy.generate(prompt, num_return_sequences=n_samples)
        # Score each completion with the reward model
        scored = [
            {
                "prompt": prompt,
                "completion": completion,
                "reward": reward_model.score(prompt, completion),
            }
            for completion in completions
        ]
        # Keep the top completions either per prompt or globally
        sampled.extend(select_top_k(scored, top_k))

    # Fine-tune on the filtered set just like supervised instruction tuning
    finetune_with_sft(sampled)
    return sampled

Selection can be per prompt or across the global pool. Chapter 10 recommends tracking how Top-K choices change when the reward model drifts.

Intuition

Why Rejection Sampling Works

Rejection sampling keeps the optimisation purely supervised while still benefiting from preference signals. By upgrading the instruction-tuning dataset with reward-filtered outputs, we gain higher-quality demonstrations without running PPO.

The trade-off is compute: more completions and stronger reward models produce better datasets but increase inference cost. Chapter 10 advises monitoring KL divergence to ensure the filtered set does not diverge far from the source policy.

Practical workflow (Chapter 10): sample → score → select → finetune → evaluate. Rinse and repeat with updated policies or reward models.

Analogy

Analogy: Editor's Slush Pile

Editors skim hundreds of drafts, keep a handful, and publish an anthology. Rejection sampling performs the same step to bootstrap better instruction-tuning data.

Editor with a slush pile

Requests many drafts (completions) and keeps only the best for publication, just like Chapter 10's baseline filtering.

Casting director

Auditions multiple actors for each role and selects the top fit. Rejection sampling performs the same filtering before supervised finetuning.

Visualization

Sampling Lab

Try the playground to see how completions-per-prompt, temperature, and selection strategy influence the filtered dataset. Then compare baseline methods.

Rejection sampling playground

Simulate Chapter 10's baseline: generate N completions per prompt, score them with a reward model, then select the best to finetune.

Parameters

Interactive visualization

Method comparison snapshot

Compare rejection sampling to PPO and DPO qualitatively. Values are illustrative, matching the trade-offs described in Chapter 10.

Parameters

Interactive visualization
Takeaways

Operational Notes

  • Sample broadly (10–30 completions per prompt) and rely on a calibrated reward model to filter quality (Chapter 10.1).
  • Track both per-prompt and global selection; they emphasise diversity differently.
  • Record metadata (temperature, sampling policy, reward version) so future runs can audit changes.
  • Best-of-N sampling applies the same selection logic at inference time; consider it when RL budgets are limited.
  • Rejection sampling is a baseline—once datasets plateau, switch to PPO or DPO for further gains.
Self-check

Rejection Sampling Check

Review the workflow, parameter trade-offs, and variants from Chapter 10.

Answered 0/5 · Correct 0/5

  1. 1

    How does Chapter 10 define rejection sampling for RLHF?

  2. 2

    What are two selection schemes highlighted in Chapter 10?

  3. 3

    Why is the completions-per-prompt budget important?

  4. 4

    How does Best-of-N (BoN) relate to rejection sampling?

  5. 5

    Which factor most strongly determines rejection sampling quality?