Reasoning Training & Inference Scaling • RLHF Module

Equation

RLVR Objective

Chapter 14 frames RL with Verifiable Rewards (RLVR) as an extension of the RLHF objective where reward signals come from automated checkers instead of human ratings. For a verifier $V(x, y) \in \{0, 1\}$ , the optimisation step looks like PPO or policy gradient with binary rewards.

\begin{aligned} \max_{\theta} \; \mathbb{E}_{x \sim \mathcal{D}, \; y \sim \pi_{\theta}(\cdot \mid x)}\big[ V(x, y) \cdot A_{\theta}(x, y) \big] \\ \text{where } A_{\theta}(x, y) = V(x, y) - b(x) \text{ or uses a critic trained on verifier outcomes.} \end{aligned}

When the verifier returns structured feedback (unit tests, proof traces), teams often log extra metadata so they can reuse successful trajectories for distillation.

Intuition

From STaR to RLVR

Earlier work like STaR and TRICE (2022-2023) approximated policy gradient on math problems by filtering traces and applying cross-entropy updates. Chapter 14 explains how modern systems such as Tulu 3, VinePPO, and DeepSeek R1 build on these ideas with binary rewards from execution engines and proof checkers.

Reinforcement finetuning lets the model practice a narrow set of prompts hundreds of times, reinforcing correct answers instead of simply imitating reference text. The result is a tutor model that plans, verifies, and revises before committing to an answer. Distillation then compresses that behaviour into smaller students that serve cheaply.

Measuring progress means tracking verifier pass rate, reward deltas, and inference tokens per answer. Chapter 14 links all three to the surge of reasoning models like o1, R1, and Tulu 3.

Analogy

Analogy: Math Tutor Studio

Imagine a classroom where every attempt at a proof is checked instantly. Students revise until the grader signs off, then the solution is archived for future lessons. RLVR converts that workflow into code and math benchmarks.

Math studio

A tutor checks each algebra step against an answer key. RLVR does the same with verifiable rewards for GSM8K-style math.

Code judge

Automated tests confirm every program revision. Reasoning models bounce ideas until the unit tests pass.

Visualization

Reasoning Practice Lab

Use the labs to step through RLVR-style revisions, inspect verifier gains per domain, and see how inference-time scaling affects accuracy, latency, and cost.

Reasoning chain lab

Step through chain-of-thought revisions reinforced with verifiable rewards (Chapter 14).

Parameters

Highlighted step0.00

Traverse the reasoning chain reinforced by verifiers.

Interactive visualization

Analogy: atari

Prompt

What is 3/4 of 96?

Baseline attempt

Model guesses 68 without showing work and fails verifier.

Step 1
Convert the fraction: one quarter of 96 is 24.
PASS
Verifier: Unit fraction check passes (96 / 4 = 24).
Step 2
Three quarters means 3 * 24 = 72.
PASS
Verifier: Arithmetic check confirms 72.
Step 3
Answer: 72. Provide explanation.
PASS
Verifier: Math reward model marks answer as correct.

RLVR reward explorer

Estimate accuracy gains when you add more verifiable reward passes (Chapter 14).

Parameters

RLVR iterations3.00

Number of verifier-guided RL passes over the same prompts.

Interactive visualization

Analogy: atari

Accuracy

76%

Baseline 52% → RLVR gain 24%.

Verifier budget

0.15 relative units

Each pass consumes compute for proof checkers, unit tests, or execution sandboxes.

Sample reuse

54% of traces retained

Chapter 14 recommends reusing successful traces to distill into instruction-tuned students.

Inference-time scaling trade-offs

Explore the accuracy, latency, and cost impact of longer chains and more self-consistency samples (Chapter 14).

Parameters

Tokens per response1024.00

Longer rollouts support deeper reasoning but increase latency.

Self-consistency samples4.00

More samples improve majority voting accuracy at extra cost.

Interactive visualization

Analogy: atari

Pass rate estimate

78%

Chapter 14 observes accuracy rising with more tokens and self-consistency votes.

Latency

24576 ms (approx)

Latency grows linearly with token budget and sample count; batch scheduling can hide some cost.

Relative cost units

4.1

Use smaller students or distillation when cost exceeds your serving budget.

Takeaways

Operational Notes

Start with verifiable benchmarks (math, code, logic) so rewards are deterministic.
Log every successful trajectory for downstream distillation and evaluation.
Plan for inference-time scaling: longer chains and self-consistency votes raise accuracy but increase latency.
Mix RLVR with instruction tuning to spread reasoning patterns to smaller models.
Track verifier coverage; if too many prompts return zero reward, broaden the dataset or relax constraints.

Self-check

Reasoning Training Check

Review RLVR fundamentals, historical context, and scaling trade-offs from Chapter 14.

Answered 0/5 · Correct 0/5

1
What distinguishes RL with Verifiable Rewards (RLVR) from standard preference-based RLHF?
2
Which early reasoning method approximated policy gradient while filtering traces with cross-entropy?
3
Why did o1 and DeepSeek R1 highlight inference-time scaling?
4
Which of the following is a verifiable reward signal cited for reasoning training?
5
How do modern reasoning recipes use distillation according to Chapter 14?