feat: Vision Data Parallel for VLM training with CP by aoshen524 · Pull Request #1 · aoshen524/miles

aoshen524 · 2026-03-03T16:00:57Z

Summary

Vision DP distributes whole images across CP ranks for independent ViT computation, replacing redundant ViT execution on every rank
Single post-ViT all-gather collects embeddings back — zero ViT-internal communication
Supports Qwen2.5-VL and Qwen3-VL (including deepstack)
Gated behind vision_dp: bool = False in FSDPArgs — opt-in, default behavior unchanged
ViT gradient sync via sync_vision_grads_across_cp also gated behind the flag

Key changes

File	Change
`miles/utils/vision_dp.py`	Core utilities: assignment, slicing, all-gather with gradient support
`miles/backends/fsdp_utils/actor.py`	Gate Vision DP patch + grad sync behind `self.args.vision_dp`
`miles/backends/fsdp_utils/arguments.py`	Add `vision_dp: bool = False` to `FSDPArgs`

Usage

# In FSDPArgs
vision_dp = True
context_parallel_size = 2

Precision Alignment (verl reference experiment)

Validated in verl-project/verl#5230 under controlled conditions (same algorithm shared across frameworks):

Scope	Params	max_diff	mean_diff	cosine_sim
vision	390	4.70e-05	2.93e-08	0.9991
language	338	9.50e-08	1.15e-10	1.0020
other	1	9.13e-08	2.25e-13	1.0001

Vision DP is numerically lossless: all differences within bf16 precision (~1e-05 max)
Language gradients are bitwise identical at pre-clip phase
Root cause: all_reduce(SUM) changes FP accumulation order in vision backward

Test plan

Unit tests for Vision DP utilities
Multi-GPU distributed test with CP enabled

🤖 Generated with Claude Code

When using Context Parallelism (cp_size > 1), Ring Flash Attention splits text attention across CP ranks, but the VisionTransformer (ViT) still processes ALL images on every rank, making ViT memory the bottleneck for multi-turn VLM training with many screenshots. Vision DP distributes whole images (not patches) across CP ranks: - Before: Each of N CP ranks processes ALL images -> O(total_images) - After: Each rank processes total_images/N images -> O(total_images/N) Key design: - Image-level contiguous distribution (no reordering after all-gather) - Gradient scaling by cp_size to compensate for FSDP reduction - Supports Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-VL-MoE Adapted from verl PR #5230. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address review comments from Gemini: 1. (Critical) Add sync_vision_grads_across_cp() to all-reduce AVG the vision tower parameter gradients across CP ranks after backward. Without this, FSDP only reduces across dp_mesh, causing ViT weights to diverge when Vision DP produces different gradients per CP rank. 2. (Medium) Replace print() with logger.info() in apply_vision_dp_patch. Gradient math: GatherVisionEmbeddings backward scales output grads by cp_size, so ViT param grads = cp_size * partial_grad. After AVG across CP: mean(cp_size * partial_k) = total_grad. Correct. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: fzyzcjy <ch271828n@outlook.com>

Co-authored-by: Yueming Yuan <yym022502@gmail.com>

…k#616)

…s r3 replay (radixark#622)

Co-authored-by: Yueming Yuan <yym022502@gmail.com>

…sample fields (radixark#548) Co-authored-by: miles-code-angel <miles.pr.bot@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Yueming Yuan <yym022502@gmail.com>

…issues Address reviewer comments (same fixes as verl PR #5230): 1. **Gradient routing fix (critical)**: Replace `grad_scaler * dp_size` with `all_reduce(SUM)` in GatherVisionEmbeddings.backward() to aggregate partial sequence gradients before slicing. Fixes silent gradient loss. 2. **sync_vision_grads_across_cp: AVG→SUM**: With the activation gradient fix, each rank's ViT backward produces partial (not scaled) param gradients. SUM (not AVG) across CP now correctly recovers the total. 3. **Load-balanced assignment**: Replace count-based chunking with greedy contiguous bin-packing that balances total patch load across ranks. 4. **Remove unnecessary all_gather**: Pass pre-computed `all_counts` from caller instead of doing all_gather in forward. 5. **Idempotency guard**: Add `_vision_dp_patched` attribute check in apply_vision_dp_patch to prevent double-wrapping. 6. **Remove Qwen3-VL-MoE dead code**: Remove unreachable qwen3_vl_moe block from apply_vision_dp_patch. 7. **GPU→CPU sync optimization**: Move `grid_thw.cpu()` to dp_vision_forward entry point. 8. **Tensor slicing**: Replace Python loop in prepare_local_vision_inputs with contiguous tensor slice using cumsum. 9. **Test improvements**: Rename tests, add load balancing test, add gather_none_group test, use parametrize. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ron PR in CI (radixark#620)

Co-authored-by: gongyisheng <yishenggong9437@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…k#653)

…d contiguous guard - Trim verbose docstrings to concise one-liners - Delete dead store ctx.hidden_size (written in forward, never read in backward) - Simplify hidden_size detection: self.config.out_hidden_size - Add requires_grad_() for empty rank to participate in backward all_reduce - Add .contiguous() guard before all_reduce (NCCL requirement) - Reuse get_image_patch_counts in spatial_merge_size==1 path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Detect Qwen3-VL via model attribute (hasattr deepstack_merger_list) instead of return type, so empty ranks that skip original_forward still create matching empty deepstack tensors and participate in all-gather — preventing NCCL deadlock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add `vision_dp: bool = False` to FSDPArgs and gate both apply_vision_dp_patch() and sync_vision_grads_across_cp() behind it. Vision DP is now opt-in rather than auto-enabled when CP > 1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…adixark#642) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Yueming Yuan <yym022502@gmail.com>

- Replace `expected_patches = end_patch - start_patch` (always-true by Python slicing) with independent cross-check via `get_image_patch_counts(local_grid_thw)` in prepare_local_vision_inputs() - Rename tests to `test_<what>_<condition>_<expected>()` convention - Add missing tests: embedding_counts empty, contiguous coverage, gather same-storage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Sync shared utility functions with verl's stricter error handling: - get_image_patch_counts/get_image_embedding_counts: empty grid_thw raises ValueError instead of returning [] - assign_images_to_dp_ranks: validate dp_size > 0, empty patch_counts raises ValueError instead of returning empty lists - prepare_local_vision_inputs: add dp_rank bounds check, use tensor-ops for offset computation (avoid Python-list round-trip), add int() cast - GatherVisionEmbeddings.forward: dp_size<=1 raises RuntimeError, validate all_counts length, max_count==0 raises RuntimeError - GatherVisionEmbeddings.backward: assert dp_size>1, add CUDA check - dp_vision_forward: cp_size<=1 raises RuntimeError, use GatherVisionEmbeddings.apply() directly, add detailed assert messages - Update tests to match: empty→raises, add dp_size/dp_rank validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…st (radixark#649)

…ixark#656) Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com>

…xark#643) Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…adixark#683)

…NT4 rollout CI (radixark#579)

…omatically (radixark#710)

Key changes from AReaL PR #929: - Private `_` prefix on internal functions (cleaner public API) - Simplified error handling: return empty instead of raising for edge cases - Extract `_unpack_deepstack()` helper (was inline in dp_vision_forward) - `_patch_vision_class()` as standalone function with `_VISION_CLASSES` registry - `importlib`-based patching replaces repeated try/except import blocks - Simplified `spatial_merge_size` lookup: `getattr(self, "spatial_merge_size", 1)` - Hidden size fallback: `out_hidden_size` or `hidden_size` - `.contiguous()` defensive guard in GatherVisionEmbeddings forward - `dp_size==1` short-circuit in GatherVisionEmbeddings (instead of raise) - `cp_size<=1` falls through to original_forward (instead of raise) - Remove cross-check assertion in _prepare_local_vision_inputs - Remove CUDA device check in backward (handled by NCCL) - Use `_gather_vision_embeddings` wrapper consistently (including deepstack) - Pass CPU grid_thw to _prepare_local_vision_inputs, move back to GPU after Miles-specific (kept from original): - Closure-based CP group passing (no Ulysses SP APIs) - `sync_vision_grads_across_cp()` for explicit ViT param grad sync Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Both Glm4vVisionModel and Glm4vMoeVisionModel (GLM-5 744B) share the same forward signature as Qwen series (hidden_states, grid_thw -> Tensor), so no changes needed to create_dp_vision_forward — just register them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aoshen524 and others added 30 commits February 16, 2026 01:14

[CI] Reorg test file and fix moonlight oom (radixark#593)

eb8d974

Implement Blackwell MXFP8 recipe (radixark#512)

89773e8

Tiny fix imbalance env var change (radixark#537)

90b66b5

Refactor and fix rollout routing replay (R3) (radixark#575)

6d15e48

Co-authored-by: fzyzcjy <ch271828n@outlook.com>

Avoid hardcoded folder paths in launch scripts (radixark#600)

ec0782a

Co-authored-by: Yueming Yuan <yym022502@gmail.com>

Make large model training scripts one-click (radixark#601)

b608e87

Fix run_glm45 script error when dynamic sampling (radixark#602)

93a384e

Fix error when launching with arbitrary working dir (radixark#604)

51f6b3c

Co-authored-by: Yueming Yuan <yym022502@gmail.com>

Fix misusing incompletely generated artifacts (radixark#605)

764b640

Auto pick num nodes in preparing megatron ckpt (radixark#607)

6678239

[bug fix] Install flashinfer_python>=0.6.2 in Dockerfile.dev (radixar…

fbb907f

…k#616)

[fix] bypass r3 for mtp layer. (radixark#619)

fa84588

Fix GLM script checkpoint directory (radixark#617)

be6cc28

[docker] upload miles docker for gb300 (cu13 + arm64) (radixark#618)

45dfe4e

[CI] fix R3 check threshold (radixark#631)

03432fa

[CI] Loose step 0 actor ref logp check threshold when ref model bypas…

48d7d24

…s r3 replay (radixark#622)

[feat] glm-4.7-flash support with r3+mtp ci (radixark#623)

88187f9

Co-authored-by: Yueming Yuan <yym022502@gmail.com>

Update Megatron-Bridge installation source in Dockerfile (radixark#624)

3c56e67

[fix & feature] fix weight update for low-precision rollout & extend …

ab33bf9

…sample fields (radixark#548) Co-authored-by: miles-code-angel <miles.pr.bot@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Yueming Yuan <yym022502@gmail.com>

Update CODEOWNERS to include new directories (radixark#633)

d5ca90d

[docker, CI] use radixark/Megatron-LM' and allow specify sglang/megat…

4ebc75e

…ron PR in CI (radixark#620)

Revise README for SWE-agent installation and setup (radixark#644)

f99ebc7

[feat] miles lora megatron backend (radixark#409)

5b59a52

Co-authored-by: gongyisheng <yishenggong9437@gmail.com>

[fix] fix R3 padding handling (radixark#645)

3fcb6ab

[model] support GLM-5 (radixark#626)

600ac74

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[fix] add scale fmt choosing to fix fp8 rollout on blackwell (radixar…

cdd31b0

…k#653)

[GLM-5, CI] skip trtllm on blackwell for glm5 & add CI (radixark#652)

1fec812

aoshen524 and others added 21 commits March 3, 2026 22:47

[fix] create parallel_state before debug_rollout_only early return (r…

2165458

…adixark#642) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Yueming Yuan <yym022502@gmail.com>

[ROCm] Migrate Dependencies and Clean Deprecated Scripts (radixark#675)

04a88d6

[GLM-5, GB300] support GLM-5 on GB300 (radixark#658)

c83cf5f

[CI] improve ckpt save/load check with hash checks & add moe model te…

f2d0243

…st (radixark#649)

[fix] Fix OAI TITO bug, cache hit issue and update sglang branch (rad…

0d04f53

…ixark#656) Co-authored-by: maocheng23 <35615230+maocheng23@users.noreply.github.com>

[Docker] Megatron version bump to Feb 13 and upgrade fla==0.4.1 (radi…

051cd15

…xark#643) Co-authored-by: Yueming Yuan <yym022502@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Update the agentic tool call format (radixark#688)

0362760

[GLM-5] bump sglang to v0.5.9 (radixark#686)

97f36fa

[GLM-5] fix CP hanging issue through adding dummy loss for autograd (r…

dee60c8

…adixark#683)

[CI] Add LoRA CI and Protection of LoRA Weight Sync (radixark#684)

ffb6189

[low-precision] add asymmetric INT4 quantizer & INT4 converter; add I…

e517a17

…NT4 rollout CI (radixark#579)

[rollout] (1/N) Add custom chat template path support (radixark#709)

2ecfbaf

[rollout] (2/N) Add chat template utils to select fixed templates aut…

2be319f

…omatically (radixark#710)

Merge branch 'main' into feat/vision-dp

8536d25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Vision Data Parallel for VLM training with CP#1

feat: Vision Data Parallel for VLM training with CP#1
aoshen524 wants to merge 51 commits intomainfrom
feat/vision-dp

aoshen524 commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

aoshen524 commented Mar 3, 2026

Summary

Key changes

Usage

Precision Alignment (verl reference experiment)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants