Skip to main content
Advanced Concepts

Reasoning Training & Inference Scaling

Walk through RL with verifiable rewards, chain-of-thought refinement, and inference-time scaling techniques used by o1, DeepSeek R1, and Tulu 3.

Estimated time
40 minutes
Difficulty
advanced
Prerequisites
2 module(s)
Equation

RLVR Objective

Chapter 14 frames RL with Verifiable Rewards (RLVR) as an extension of the RLHF objective where reward signals come from automated checkers instead of human ratings. For a verifier V(x,y){0,1}V(x, y) \in \{0, 1\}, the optimisation step looks like PPO or policy gradient with binary rewards.

maxθ  ExD,  yπθ(x)[V(x,y)Aθ(x,y)]where Aθ(x,y)=V(x,y)b(x) or uses a critic trained on verifier outcomes.\begin{aligned} \max_{\theta} \; \mathbb{E}_{x \sim \mathcal{D}, \; y \sim \pi_{\theta}(\cdot \mid x)}\big[ V(x, y) \cdot A_{\theta}(x, y) \big] \\ \text{where } A_{\theta}(x, y) = V(x, y) - b(x) \text{ or uses a critic trained on verifier outcomes.} \end{aligned}

When the verifier returns structured feedback (unit tests, proof traces), teams often log extra metadata so they can reuse successful trajectories for distillation.

Intuition

From STaR to RLVR

Earlier work like STaR and TRICE (2022-2023) approximated policy gradient on math problems by filtering traces and applying cross-entropy updates. Chapter 14 explains how modern systems such as Tulu 3, VinePPO, and DeepSeek R1 build on these ideas with binary rewards from execution engines and proof checkers.

Reinforcement finetuning lets the model practice a narrow set of prompts hundreds of times, reinforcing correct answers instead of simply imitating reference text. The result is a tutor model that plans, verifies, and revises before committing to an answer. Distillation then compresses that behaviour into smaller students that serve cheaply.

Measuring progress means tracking verifier pass rate, reward deltas, and inference tokens per answer. Chapter 14 links all three to the surge of reasoning models like o1, R1, and Tulu 3.

Analogy

Analogy: Math Tutor Studio

Imagine a classroom where every attempt at a proof is checked instantly. Students revise until the grader signs off, then the solution is archived for future lessons. RLVR converts that workflow into code and math benchmarks.

Math studio

A tutor checks each algebra step against an answer key. RLVR does the same with verifiable rewards for GSM8K-style math.

Code judge

Automated tests confirm every program revision. Reasoning models bounce ideas until the unit tests pass.

Visualization

Reasoning Practice Lab

Use the labs to step through RLVR-style revisions, inspect verifier gains per domain, and see how inference-time scaling affects accuracy, latency, and cost.

Reasoning chain lab

Step through chain-of-thought revisions reinforced with verifiable rewards (Chapter 14).

Parameters

Interactive visualization

RLVR reward explorer

Estimate accuracy gains when you add more verifiable reward passes (Chapter 14).

Parameters

Interactive visualization

Inference-time scaling trade-offs

Explore the accuracy, latency, and cost impact of longer chains and more self-consistency samples (Chapter 14).

Parameters

Interactive visualization
Takeaways

Operational Notes

  • Start with verifiable benchmarks (math, code, logic) so rewards are deterministic.
  • Log every successful trajectory for downstream distillation and evaluation.
  • Plan for inference-time scaling: longer chains and self-consistency votes raise accuracy but increase latency.
  • Mix RLVR with instruction tuning to spread reasoning patterns to smaller models.
  • Track verifier coverage; if too many prompts return zero reward, broaden the dataset or relax constraints.
Self-check

Reasoning Training Check

Review RLVR fundamentals, historical context, and scaling trade-offs from Chapter 14.

Answered 0/5 · Correct 0/5

  1. 1

    What distinguishes RL with Verifiable Rewards (RLVR) from standard preference-based RLHF?

  2. 2

    Which early reasoning method approximated policy gradient while filtering traces with cross-entropy?

  3. 3

    Why did o1 and DeepSeek R1 highlight inference-time scaling?

  4. 4

    Which of the following is a verifiable reward signal cited for reasoning training?

  5. 5

    How do modern reasoning recipes use distillation according to Chapter 14?