feat: Vision Data Parallel for VLM training with Ulysses SP by aoshen524 · Pull Request #1 · aoshen524/ROLL

aoshen524 · 2026-03-03T16:01:03Z

Summary

Vision DP distributes whole images across Ulysses SP ranks for independent ViT computation
Single post-ViT all-gather collects embeddings back — zero ViT-internal communication
Supports Qwen2.5-VL and Qwen3-VL (including deepstack)
Gated behind vision_dp: bool = False in ModelArguments — opt-in, default behavior unchanged
Applied in both DeepSpeedInferStrategy and DeepSpeedTrainStrategy

Key changes

File	Change
`roll/utils/context_parallel/vision_dp.py`	Core utilities: assignment, slicing, all-gather with gradient support
`roll/configs/model_args.py`	Add `vision_dp: bool = False` to `ModelArguments`
`roll/distributed/strategy/deepspeed_strategy.py`	Gate `apply_vision_dp_patch()` behind `vision_dp` flag in both infer and train strategies

Usage

--vision_dp true --ulysses_size 2

Precision Alignment (verl reference experiment)

Validated in verl-project/verl#5230 under controlled conditions (same algorithm shared across frameworks):

Scope	Params	max_diff	mean_diff	cosine_sim
vision	390	4.70e-05	2.93e-08	0.9991
language	338	9.50e-08	1.15e-10	1.0020
other	1	9.13e-08	2.25e-13	1.0001

Vision DP is numerically lossless: all differences within bf16 precision (~1e-05 max)
Language gradients are bitwise identical at pre-clip phase

Test plan

Unit tests for Vision DP utilities
Multi-GPU distributed test with Ulysses SP enabled

🤖 Generated with Claude Code

…es SP ranks Distribute whole images across Ulysses SP ranks for parallelized ViT computation, reducing ViT peak memory by ~sp_size x (e.g. SP=4 -> ~4x ViT memory reduction). Key changes: - Add roll/utils/context_parallel/vision_dp.py with image distribution utilities, GatherVisionEmbeddings autograd function, and model-agnostic VisionTransformer wrapper - Add apply_vision_dp_patch() in monkey_patch.py for Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-VL-MoE VisionTransformer classes - Integrate into DeepSpeed strategy (both inference and training workers) - Add 17 unit tests covering all utility functions, edge cases, and integration workflows Ported from verl (verl-project/verl#5230). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…issues Address reviewer comments (same fixes as verl PR #5230 and AReaL PR #929): 1. **Gradient routing fix (critical)**: Replace `grad_scaler * dp_size` with `all_reduce(SUM)` in GatherVisionEmbeddings.backward() to aggregate partial sequence gradients before slicing. Fixes silent gradient loss when vision tokens span multiple sequence shard boundaries. 2. **Load-balanced assignment**: Replace count-based chunking with greedy contiguous bin-packing that balances total patch load across ranks. 3. **Remove unnecessary all_gather**: Pass pre-computed `all_counts` from caller instead of doing all_gather in forward. 4. **Idempotency guard**: Extract `_patch_vision_class()` helper with `_vision_dp_patched` attribute check. Add `_unapply_vision_class()` to properly clear the flag on unapply. 5. **Remove Qwen3-VL-MoE dead code**: Remove unreachable qwen3_vl_moe blocks from apply/unapply (not yet in transformers vl_model_mappings). 6. **GPU→CPU sync optimization**: Move `grid_thw.cpu()` to dp_vision_forward entry point to avoid repeated `.tolist()` GPU→CPU syncs. 7. **Tensor slicing**: Replace Python loop + list append in prepare_local_vision_inputs with contiguous tensor slice using cumsum. 8. **Test improvements**: Rename tests, add load balancing test, add gather_none_group test, use parametrize. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…d contiguous guard - Trim verbose docstrings to concise one-liners - Delete dead store ctx.hidden_size (written in forward, never read in backward) - Simplify hidden_size detection: self.config.out_hidden_size - Add requires_grad_() for empty rank to participate in backward all_reduce - Add .contiguous() guard before all_reduce (NCCL requirement) - Reuse get_image_patch_counts in spatial_merge_size==1 path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace isinstance(tuple) check with model attribute detection (hasattr deepstack_merger_list). Empty ranks now create matching empty deepstack tensors and participate in all-gather, preventing NCCL deadlock when num_images < dp_size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add `vision_dp: bool = False` to ModelArguments and gate apply_vision_dp_patch() calls in both DeepSpeedInferStrategy and DeepSpeedTrainStrategy behind it. Vision DP is now opt-in. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace `expected_patches = end_patch - start_patch` (always-true by Python slicing) with independent cross-check via `get_image_patch_counts(local_grid_thw)` in prepare_local_vision_inputs() - Rename tests to `test_<what>_<condition>_<expected>()` convention - Add missing tests: embedding_counts empty, contiguous coverage, gather same-storage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Sync shared utility functions with verl's stricter error handling: - get_image_patch_counts/get_image_embedding_counts: empty grid_thw raises ValueError instead of returning [] - assign_images_to_dp_ranks: validate dp_size > 0, empty patch_counts raises ValueError instead of returning empty lists - prepare_local_vision_inputs: add dp_rank bounds check, use tensor-ops for offset computation (avoid Python-list round-trip), add int() cast - GatherVisionEmbeddings.forward: dp_size<=1 raises RuntimeError, validate all_counts length, max_count==0 raises RuntimeError - GatherVisionEmbeddings.backward: assert dp_size>1, add CUDA check - dp_vision_forward: sp_size<=1 raises RuntimeError, use GatherVisionEmbeddings.apply() directly, add detailed assert messages - Update tests to match: empty→raises, add dp_size/dp_rank validation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Call apply_vision_dp_patch() in fsdp2_strategy.py after set_upg_manager(), mirroring the existing pattern in deepspeed_strategy.py. This ensures Vision DP works correctly with FSDP2, not just DeepSpeed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

aoshen524 and others added 8 commits February 26, 2026 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Vision Data Parallel for VLM training with Ulysses SP#1

feat: Vision Data Parallel for VLM training with Ulysses SP#1
aoshen524 wants to merge 8 commits intomainfrom
feat/vision-dp-ulysses

aoshen524 commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aoshen524 commented Mar 3, 2026

Summary

Key changes

Usage

Precision Alignment (verl reference experiment)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant