Instruction Tuning • RLHF Module

Equation

Supervised Objective

Instruction tuning reuses the autoregressive cross-entropy loss, but Chapter 9 emphasises masking so only assistant tokens contribute. Given prompts $x$ and target responses $y$ , the loss is:

\mathcal{L}_{\text{IFT}}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}}\big[ \log \pi_\theta(y|x) \big]

Prompt tokens are masked (loss weight zero) so the model imitates only the assistant responses. Multi-turn chats are unrolled so each assistant turn is trained separately while previous turns remain in the context (Chapter 9).

Intuition

Why Instruction Tuning Matters

Instruction tuning converts a base language model into a conversational agent. Chapter 9 positions it as the first stage of post-training: establish the question–answer format, teach the model to follow instructions, and curate a diverse mixture of tasks (FLAN, Natural Instructions, Tulu-style blends).

Good instruction tuning raises the floor for downstream RLHF work—reward models, PPO, and DPO all assume consistent templates. The chapter also highlights practical considerations: smaller batch sizes than pretraining, prompt/turn masking, and maintaining metadata for auditing synthetic augmentations.

Recipe reminders:

Template discipline. Adopt a single chat template per model family.
Diverse tasks. Blend human-written tasks with filtered synthetic expansions.
Masking. Apply loss only to assistant tokens and handle multi-turn chats carefully.

Analogy

Analogy: Writing Coach & Script Editor

The writing coach supplies exemplars; the script editor enforces the dialogue format. Instruction tuning similarly provides demonstrations while controlling how they are rendered to the model’s tokenizer.

Writing coach

Demonstrates the format and tone required, providing exemplar answers that the model imitates during supervised fine-tuning.

Script editor

Defines the chat template and masks stage directions so actors focus on their lines—mirroring instruction tuning’s selective loss masking.

Visualization

Template & Masking Lab

Experiment with the chat template builder and masking visualiser to internalise Chapter 9’s workflow.

Chat template builder

Adjust the system prompt and message turns to see how datasets are serialised before instruction tuning, as outlined in Chapter 9.

Interactive visualization

Analogy: atari

Template

Use <|im_start|>role ... markers like in Chapter 9.

System messageUser prompt (turn 1)Assistant response (turn 1)Optional user follow-up (turn 2)

Serialised conversation

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Summarise the key points of RLHF.<|im_end|>
<|im_start|>assistant
RLHF combines instruction tuning, preference data, and policy optimisation to align models.<|im_end|>
<|im_start|>user
Can you add a safety caveat?<|im_end|>
<|im_start|>assistant

Why it matters

Chapter 9 emphasises consistent templates so token masking and preference data align. Use this builder to double-check BOS/EOS markers and alternating roles before writing dataset converters.

Prompt masking visualiser

See which tokens contribute to the loss during instruction tuning, following the masking guidance from Chapter 9.

Parameters

Include second user turnMask prompt tokens (SFT convention)

Interactive visualization

Analogy: atari

<|im_start|>systemsystem

You are a helpful assistant.system

<|im_end|>system

<|im_start|>useruser

Explain RLHF in two sentences.user

<|im_end|>user

<|im_start|>assistantassistant

RLHF combines instruction tuning, preference data, and policy optimisation.assistant

It aligns outputs with human intent via reward models or direct preference optimisation.assistant

<|im_end|>assistant

<|im_start|>useruser

Add a short safety disclaimer.user

<|im_end|>user

<|im_start|>assistantassistant

Note: Verify outputs with internal policy before acting.assistant

<|im_end|>assistant

Why mask prompts?

Chapter 9 explains that instruction tuning only applies loss to assistant tokens so the model learns responses, not the user’s words. Multi-turn sequences are “unrolled” so each target assistant turn remains unmasked while earlier context stays in the prompt.

Takeaways

Implementation Notes

Instruction tuning keeps the standard cross-entropy loss but masks prompt tokens so only assistant outputs are learned.
Chat templates (system/user/assistant markers) must be consistent across instruction, preference, and RL datasets.
Multi-turn dialogues are unrolled into multiple samples so each assistant turn is a target.
Datasets combine curated human prompts with synthetic expansions; track provenance and apply safety filters.
Instruction tuning runs with smaller batch sizes and often precedes RLHF cycles like rejection sampling or PPO.

Self-check

Instruction Tuning Check

Verify your understanding of chat templates, masking, and dataset practices from Chapter 9.

Answered 0/5 · Correct 0/5

1
What loss does Chapter 9 use for instruction tuning?
2
Why are chat templates critical according to Chapter 9?
3
What does prompt masking achieve during instruction tuning?
4
How are multi-turn conversations prepared for instruction tuning?
5
Which dataset practice does Chapter 9 highlight?