Reasoning Training & Inference Scaling
Walk through RL with verifiable rewards, chain-of-thought refinement, and inference-time scaling techniques used by o1, DeepSeek R1, and Tulu 3.
- Estimated time
- 40 minutes
- Difficulty
- advanced
- Prerequisites
- 2 module(s)
RLVR Objective
Chapter 14 frames RL with Verifiable Rewards (RLVR) as an extension of the RLHF objective where reward signals come from automated checkers instead of human ratings. For a verifier , the optimisation step looks like PPO or policy gradient with binary rewards.
When the verifier returns structured feedback (unit tests, proof traces), teams often log extra metadata so they can reuse successful trajectories for distillation.
From STaR to RLVR
Earlier work like STaR and TRICE (2022-2023) approximated policy gradient on math problems by filtering traces and applying cross-entropy updates. Chapter 14 explains how modern systems such as Tulu 3, VinePPO, and DeepSeek R1 build on these ideas with binary rewards from execution engines and proof checkers.
Reinforcement finetuning lets the model practice a narrow set of prompts hundreds of times, reinforcing correct answers instead of simply imitating reference text. The result is a tutor model that plans, verifies, and revises before committing to an answer. Distillation then compresses that behaviour into smaller students that serve cheaply.
Measuring progress means tracking verifier pass rate, reward deltas, and inference tokens per answer. Chapter 14 links all three to the surge of reasoning models like o1, R1, and Tulu 3.
Analogy: Math Tutor Studio
Imagine a classroom where every attempt at a proof is checked instantly. Students revise until the grader signs off, then the solution is archived for future lessons. RLVR converts that workflow into code and math benchmarks.
Math studio
A tutor checks each algebra step against an answer key. RLVR does the same with verifiable rewards for GSM8K-style math.
Code judge
Automated tests confirm every program revision. Reasoning models bounce ideas until the unit tests pass.
Reasoning Practice Lab
Use the labs to step through RLVR-style revisions, inspect verifier gains per domain, and see how inference-time scaling affects accuracy, latency, and cost.
Reasoning chain lab
Step through chain-of-thought revisions reinforced with verifiable rewards (Chapter 14).
Parameters
Prompt
What is 3/4 of 96?
Baseline attempt
Model guesses 68 without showing work and fails verifier.
- PASS
Step 1
Convert the fraction: one quarter of 96 is 24.
Verifier: Unit fraction check passes (96 / 4 = 24).
- PASS
Step 2
Three quarters means 3 * 24 = 72.
Verifier: Arithmetic check confirms 72.
- PASS
Step 3
Answer: 72. Provide explanation.
Verifier: Math reward model marks answer as correct.
RLVR reward explorer
Estimate accuracy gains when you add more verifiable reward passes (Chapter 14).
Parameters
Accuracy
76%
Baseline 52% → RLVR gain 24%.
Verifier budget
0.15 relative units
Each pass consumes compute for proof checkers, unit tests, or execution sandboxes.
Sample reuse
54% of traces retained
Chapter 14 recommends reusing successful traces to distill into instruction-tuned students.
Inference-time scaling trade-offs
Explore the accuracy, latency, and cost impact of longer chains and more self-consistency samples (Chapter 14).
Parameters
Pass rate estimate
78%
Chapter 14 observes accuracy rising with more tokens and self-consistency votes.
Latency
24576 ms (approx)
Latency grows linearly with token budget and sample count; batch scheduling can hide some cost.
Relative cost units
4.1
Use smaller students or distillation when cost exceeds your serving budget.
Operational Notes
- Start with verifiable benchmarks (math, code, logic) so rewards are deterministic.
- Log every successful trajectory for downstream distillation and evaluation.
- Plan for inference-time scaling: longer chains and self-consistency votes raise accuracy but increase latency.
- Mix RLVR with instruction tuning to spread reasoning patterns to smaller models.
- Track verifier coverage; if too many prompts return zero reward, broaden the dataset or relax constraints.
Reasoning Training Check
Review RLVR fundamentals, historical context, and scaling trade-offs from Chapter 14.
Answered 0/5 · Correct 0/5
- 1
What distinguishes RL with Verifiable Rewards (RLVR) from standard preference-based RLHF?
- 2
Which early reasoning method approximated policy gradient while filtering traces with cross-entropy?
- 3
Why did o1 and DeepSeek R1 highlight inference-time scaling?
- 4
Which of the following is a verifiable reward signal cited for reasoning training?
- 5
How do modern reasoning recipes use distillation according to Chapter 14?