[vllm,trainer,algo] feat: Enable On-Policy Distillation for VLM by JacobHelwig · Pull Request #5592 · verl-project/verl

JacobHelwig · 2026-03-14T15:14:04Z

What does this PR do?

Adds support for OPD with VLM student and teacher.

Test

Tested with examples/on_policy_distillation_trainer/run_qwen3_vl_geo3k.sh.

Data: Geometry3K
Student: Qwen3-VL-2B-Instruct
Teacher: Qwen3-VL-4B-Instruct
OPD algo: k1 KL estimator as reward with policy gradient loss

Geo3K eval acc

Geo3K train acc

Distillation loss

Design & Code Changes

This PR is stacked on #4897 . Here's the diff between the two branches: JacobHelwig/verl@jhelwig/onPolicyDistillation...JacobHelwig:verl:jhelwig/opdServer.

#4897 submits requests to the vLLMHttpServer via the v1/completions endpoint, which does not support multi-modal data. While v1/chat/completions does support multi-modal inputs, text must be passed as raw text instead of token IDs, preventing exact scoring of student generations since student gen IDs -> student gen text -> teacher input IDs via v1/chat/completions tokenization will not always give student gen IDs == teacher input IDs (https://vllm.ai/blog/agent-lightning).

This PR instead follows a path similar to how rollout replicas directly call the generate method on the vLLMHttpServer. This enables multi-modal inputs while representing text as token IDs. Requests to the teacher server now call the newly-added compute_logprobs method of vLLMHttpServer.

gemini-code-assist

Code Review

This pull request introduces a significant feature: on-policy distillation (OPD) for vLLM, with a focus on Vision Language Models. The core of the change is a new compute_logprobs method on the vLLMHttpServer that allows for scoring student-generated tokens with a teacher model, even for multi-modal inputs. The implementation includes a new distillation module with configurable loss functions, a TeacherModelManager for handling teacher model replicas, and updates across the training pipeline to integrate this new capability. The changes are extensive and well-thought-out. My review includes a couple of suggestions to improve the robustness and usability of the new functionality.

gemini-code-assist · 2026-03-14T15:18:06Z

verl/workers/utils/padding.py

+    assert not prompt_lens.eq(0).any(), f"seq_offset - resp_len - 1 assumes prompt_len > 0. Got {prompt_lens}"

    response_list = []
+    # Skip padding dimensions after sequence dimensions, if any.
+    skip_padding = (0, 0) * (values.ndim - 1)
    for resp_len, seq_offset in zip(response_lens, sequence_offsets, strict=True):
        pad_size = max_response_len - resp_len
        # left-shift model output by one token for log_probs/values
-        response_list.append(F.pad(values[seq_offset - resp_len - 1 : seq_offset - 1], (0, pad_size)))
+        response_list.append(F.pad(values[seq_offset - resp_len - 1 : seq_offset - 1], (*skip_padding, 0, pad_size)))


This assertion prevents the function from handling empty prompts, which might be a valid use case. The indexing seq_offset - resp_len - 1 on line 127 is incorrect when prompt_len is 0, as there is no token preceding the response to compute the first log-probability.

To make this function more robust, consider removing the assertion and handling the empty prompt case explicitly. When the prompt is empty, you could pad for the missing first log-probability.

Suggested change

assert not prompt_lens.eq(0).any(), f"seq_offset - resp_len - 1 assumes prompt_len > 0. Got {prompt_lens}"

response_list = []

# Skip padding dimensions after sequence dimensions, if any.

skip_padding = (0, 0) * (values.ndim - 1)

for resp_len, seq_offset in zip(response_lens, sequence_offsets, strict=True):

pad_size = max_response_len - resp_len

# left-shift model output by one token for log_probs/values

response_list.append(F.pad(values[seq_offset - resp_len - 1 : seq_offset - 1], (0, pad_size)))

response_list.append(F.pad(values[seq_offset - resp_len - 1 : seq_offset - 1], (*skip_padding, 0, pad_size)))

response_list = []

# Skip padding dimensions after sequence dimensions, if any.

skip_padding = (0, 0) * (values.ndim - 1)

for i, (resp_len, seq_offset) in enumerate(zip(response_lens, sequence_offsets, strict=True)):

pad_size = max_response_len - resp_len

if resp_len == 0:

shape = (0,) + values.shape[1:]

response_list.append(torch.zeros(shape, dtype=values.dtype, device=values.device))

continue

if prompt_lens[i] > 0:

# left-shift model output by one token for log_probs/values

item = values[seq_offset - resp_len - 1 : seq_offset - 1]

response_list.append(F.pad(item, (*skip_padding, 0, pad_size)))

else: # empty prompt

# Handle empty prompt: logprob for the first token is not available.

item = values[seq_offset - resp_len : seq_offset - 1]

# Pad for the missing first logprob and for sequence length alignment.

response_list.append(F.pad(item, (*skip_padding, 1, pad_size)))

gemini-code-assist · 2026-03-14T15:18:07Z

verl/workers/rollout/vllm_rollout/vllm_async_server.py

+        if temp != 1.0:
+            raise NotImplementedError("vLLM doesn't support temperature for prompt logprobs")


Raising a NotImplementedError if the temperature is not 1.0 can lead to unexpected runtime failures, especially since the teacher's temperature is often inherited from the student's configuration. A user might not be aware of this vLLM limitation.

To improve usability, consider logging a warning and overriding the temperature to 1.0 instead of raising an error. This would prevent crashes while still informing the user about the limitation.

Suggested change

if temp != 1.0:

raise NotImplementedError("vLLM doesn't support temperature for prompt logprobs")

if temp != 1.0:

logger.warning("vLLM doesn't support temperature for prompt logprobs. Overriding to 1.0.")

temp = 1.0

mergify · 2026-03-14T22:16:05Z

⚠️ The sha of the head commit of this PR conflicts with #5164. Mergify cannot evaluate rules on this PR. Once #5164 is merged or closed, Mergify will resume processing this PR. ⚠️

JacobHelwig added 30 commits March 13, 2026 15:33

Move padding

1f542fc

Prompt lens

69545f9

Tests

0428b71

Multiple 3D tensors test

3dfec68

Merge tests

5a702b2

init

c3456cb

Init debug script

42f0399

Plan

5aa6945

Add top-k log probs

9bf6166

Stage-wise top-k

9bd125b

Distillation cfg

9afbad7

Distillation losses

f07d8a1

Re-factor distillation

870f9f9

RM unused

7cc1224

Working: Add full JSD/KL, rm FSDP cfg, routing by distillation loss type

dc324ad

rm example

a9303b1

dp distillation cfg

8ca75cd

Fix clamping

53dd65a

Enable ref w distillation

b41d8b8

Teacher model cfg

881fc11

Legacy distillation

f94db4d

Distillation cfg to actor_rollout_ref

5d2af7b

Clamping and pass distillation cfg instead of actor

d9ed17c

Ruff

9bff0d0

Distillation training script

587df10

Update train script

aa961a0

Decouple distillation and ref configs

d78d93b

Distillation cfg validation

3c88a7c

Loss settings in cfg

fe0d033

Use estimator

2598f7b

JacobHelwig added 7 commits March 13, 2026 15:33

Update veomni

994e5ca

PC

68ff140

RM math script

b75035b

PC

d68c303

pre-commit

06bafd0

add back MTP after rebase

98ba5ad

Fix agent loop test

c5e0e40

JacobHelwig mentioned this pull request Mar 14, 2026

[fsdp,vllm,trainer,algo] feat: On-Policy Distillation #4897

Open

gemini-code-assist bot reviewed Mar 14, 2026

View reviewed changes

Fix agent loop tests

17c4b7c

JacobHelwig force-pushed the jhelwig/opdServer branch from ce38142 to 04c3c69 Compare March 14, 2026 15:32

JacobHelwig added 15 commits March 14, 2026 11:04

Agent loop test

5786856

rm stages

45c5af2

logprobs in async server

1ed8a61

RM teacher loop worker

31c3de7

Teacher logprobs in agent loop

8ed1249

Collapse teacher loop manager into teacher model manager

755134b

Fix estimator

d9c3fe2

Error message length

55cbd12

PC

e81e20d

geo3k OPD script

4a1b65b

temp

9ab80af

Add script

8301043

Temp in gen cfgs

219e504

Dist enabled in agent loop

a42b574

rm comment

8dc7fd5

JacobHelwig force-pushed the jhelwig/opdServer branch from 04c3c69 to 8dc7fd5 Compare March 14, 2026 16:43

JacobHelwig changed the title ~~[vllm,trainer,algo] feat: Enable On-Policy Distillation for vLLM~~ [vllm,trainer,algo] feat: Enable On-Policy Distillation for VLM Mar 14, 2026

JacobHelwig mentioned this pull request Mar 14, 2026

[WIP] feat: Multi-Task On-Policy Distillation #5164

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vllm,trainer,algo] feat: Enable On-Policy Distillation for VLM#5592

[vllm,trainer,algo] feat: Enable On-Policy Distillation for VLM#5592
JacobHelwig wants to merge 180 commits intoverl-project:mainfrom
JacobHelwig:jhelwig/opdServer

JacobHelwig commented Mar 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 14, 2026

Uh oh!

gemini-code-assist bot Mar 14, 2026

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if temp != 1.0:
		raise NotImplementedError("vLLM doesn't support temperature for prompt logprobs")

Conversation

JacobHelwig commented Mar 14, 2026

What does this PR do?

Test

Geo3K eval acc

Geo3K train acc

Distillation loss

Design & Code Changes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant