On-Policy Context Distillation

Research Question

How does on-policy distillation compare to off-policy distillation for training student models to match teacher models in few-shot learning contexts?

TL;DR

On-policy is strictly better. Off-policy causes catastrophic collapse.

We tested two setups:

Setup	Result
Context distillation (same model ± context)	0% downstream — only teaches format matching
Size distillation (small ← large model)	Hybrid collapses; pure on-policy works

Winner: teacher_seeded — on-policy GKD with decaying teacher prefix tokens (58-71% GSM8K accuracy).

See EXPERIMENT_FINDINGS.md for the full analysis.

Final Results

Method	Qwen (4B←30B)	Llama (8B←70B)	Off-Policy?
teacher_seeded	58.6%	71.0%	No
on_policy_gkd	53.2%	67.4%	No
extended_on_policy	51.0%	64.4%	No
replay_buffer	2.8%	7.4%	Partial
hybrid (off→on)	0.0%	0.0%	Yes
mixture (blend)	0.0%	0.0%	Yes
kl_anchored	0.0%	0.0%	Yes
reverse_curriculum	0.0%	0.0%	Yes

Pattern: Every method with any off-policy component collapses. Pure on-policy works.

The Distribution Cliff Problem

When transitioning from off-policy to on-policy training:

Step 0-49:   Off-policy (supervised on teacher outputs)
Step 50:     PHASE TRANSITION
             → Student generates with modified weights
             → Produces "teacher-like garbage" (similar tokens, broken reasoning)
             → Teacher assigns low logprobs
             → GKD pushes student toward degenerate solutions
Step 50-100: COLLAPSE (scores drop to 0%, unrecoverable)

Root cause: Off-policy teaches token mimicry without reasoning. When forced to generate independently, the student produces syntactically similar but semantically broken output.

Why Teacher Seeding Works

teacher_seeded provides off-policy benefits without off-policy's failure mode:

Step 0:   [Teacher: 20 tokens] [Student: rest]
Step 25:  [Teacher: 10 tokens] [Student: rest]
Step 50:  [Teacher: 0 tokens]  [Student: all]

Teacher prefix keeps student generations coherent
Student learns from meaningful teacher feedback
Gradual handoff prevents the hard phase transition that causes collapse

Quick Start

# Run with Qwen (recommended — best model family)
python run_cliff_mitigation_experiment.py --model-family qwen --n-seeds 10

# Run with Llama
python run_cliff_mitigation_experiment.py --model-family llama --n-seeds 10

# Quick test (3 seeds)
python run_cliff_mitigation_experiment.py --quick-test

# Run specific methods
python run_cliff_mitigation_experiment.py --methods teacher_seeded on_policy_gkd

Project Structure

context-distillation/
├── README.md                           # This file
├── EXPERIMENT_FINDINGS.md              # Full analysis with project history
├── run_cliff_mitigation_experiment.py  # Cliff mitigation experiment (8 methods)
├── run_tinker_experiment.py            # Original 3-mode experiment
├── tinker_trainer.py                   # Tinker SDK trainer with all methods
├── config.py                           # Configuration classes
├── context_generator.py                # Few-shot context generation
├── requirements.txt                    # Python dependencies
└── .env.example                        # Environment variable template

Model Families

Family	Student	Teacher	Gap
`qwen`	Qwen3-4B-Instruct	Qwen3-30B-A3B-Instruct	7.5x
`llama`	Llama-3.1-8B-Instruct	Llama-3.3-70B-Instruct	8.75x

Deprecated (0% downstream accuracy):

same-model-qwen / same-model-llama — context distillation only teaches format matching

Key Takeaways

Context distillation doesn't work for capability transfer — use size distillation instead
Never mix on-policy and off-policy — any off-policy exposure causes collapse
Teacher seeding is the best approach — soft transition without off-policy corruption
The GKD paper's hybrid recommendation doesn't generalize — capability gaps change everything

References

On-Policy Distillation of Language Models (GKD) - Agarwal et al. 2024
Learning by Distilling Context - Snell et al. 2022
Original Project Spec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On-Policy Context Distillation

Research Question

TL;DR

Final Results

The Distribution Cliff Problem

Why Teacher Seeding Works

Quick Start

Project Structure

Model Families

Key Takeaways

References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env.example		.env.example
.gitignore		.gitignore
EXPERIMENT_FINDINGS.md		EXPERIMENT_FINDINGS.md
README.md		README.md
config.py		config.py
context_generator.py		context_generator.py
requirements.txt		requirements.txt
run_cliff_mitigation_experiment.py		run_cliff_mitigation_experiment.py
run_tinker_experiment.py		run_tinker_experiment.py
tinker_trainer.py		tinker_trainer.py

bledden/context-distillation-tinkerideas

Folders and files

Latest commit

History

Repository files navigation

On-Policy Context Distillation

Research Question

TL;DR

Final Results

The Distribution Cliff Problem

Why Teacher Seeding Works

Quick Start

Project Structure

Model Families

Key Takeaways

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages