Skip to content

feat: Vision Data Parallel for VLM training with Ulysses SP#1

Open
aoshen524 wants to merge 8 commits intomainfrom
feat/vision-dp-ulysses
Open

feat: Vision Data Parallel for VLM training with Ulysses SP#1
aoshen524 wants to merge 8 commits intomainfrom
feat/vision-dp-ulysses

Conversation

@aoshen524
Copy link
Owner

Summary

  • Vision DP distributes whole images across Ulysses SP ranks for independent ViT computation
  • Single post-ViT all-gather collects embeddings back — zero ViT-internal communication
  • Supports Qwen2.5-VL and Qwen3-VL (including deepstack)
  • Gated behind vision_dp: bool = False in ModelArguments — opt-in, default behavior unchanged
  • Applied in both DeepSpeedInferStrategy and DeepSpeedTrainStrategy

Key changes

File Change
roll/utils/context_parallel/vision_dp.py Core utilities: assignment, slicing, all-gather with gradient support
roll/configs/model_args.py Add vision_dp: bool = False to ModelArguments
roll/distributed/strategy/deepspeed_strategy.py Gate apply_vision_dp_patch() behind vision_dp flag in both infer and train strategies

Usage

--vision_dp true --ulysses_size 2

Precision Alignment (verl reference experiment)

Validated in verl-project/verl#5230 under controlled conditions (same algorithm shared across frameworks):

Scope Params max_diff mean_diff cosine_sim
vision 390 4.70e-05 2.93e-08 0.9991
language 338 9.50e-08 1.15e-10 1.0020
other 1 9.13e-08 2.25e-13 1.0001
  • Vision DP is numerically lossless: all differences within bf16 precision (~1e-05 max)
  • Language gradients are bitwise identical at pre-clip phase

Test plan

  • Unit tests for Vision DP utilities
  • Multi-GPU distributed test with Ulysses SP enabled

🤖 Generated with Claude Code

aoshen524 and others added 8 commits February 26, 2026 18:24
…es SP ranks

Distribute whole images across Ulysses SP ranks for parallelized ViT computation,
reducing ViT peak memory by ~sp_size x (e.g. SP=4 -> ~4x ViT memory reduction).

Key changes:
- Add roll/utils/context_parallel/vision_dp.py with image distribution utilities,
  GatherVisionEmbeddings autograd function, and model-agnostic VisionTransformer wrapper
- Add apply_vision_dp_patch() in monkey_patch.py for Qwen2-VL, Qwen2.5-VL, Qwen3-VL,
  Qwen3-VL-MoE VisionTransformer classes
- Integrate into DeepSpeed strategy (both inference and training workers)
- Add 17 unit tests covering all utility functions, edge cases, and integration workflows

Ported from verl (verl-project/verl#5230).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…issues

Address reviewer comments (same fixes as verl PR #5230 and AReaL PR #929):

1. **Gradient routing fix (critical)**: Replace `grad_scaler * dp_size` with
   `all_reduce(SUM)` in GatherVisionEmbeddings.backward() to aggregate
   partial sequence gradients before slicing. Fixes silent gradient loss
   when vision tokens span multiple sequence shard boundaries.

2. **Load-balanced assignment**: Replace count-based chunking with greedy
   contiguous bin-packing that balances total patch load across ranks.

3. **Remove unnecessary all_gather**: Pass pre-computed `all_counts` from
   caller instead of doing all_gather in forward.

4. **Idempotency guard**: Extract `_patch_vision_class()` helper with
   `_vision_dp_patched` attribute check. Add `_unapply_vision_class()` to
   properly clear the flag on unapply.

5. **Remove Qwen3-VL-MoE dead code**: Remove unreachable qwen3_vl_moe
   blocks from apply/unapply (not yet in transformers vl_model_mappings).

6. **GPU→CPU sync optimization**: Move `grid_thw.cpu()` to dp_vision_forward
   entry point to avoid repeated `.tolist()` GPU→CPU syncs.

7. **Tensor slicing**: Replace Python loop + list append in
   prepare_local_vision_inputs with contiguous tensor slice using cumsum.

8. **Test improvements**: Rename tests, add load balancing test, add
   gather_none_group test, use parametrize.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d contiguous guard

- Trim verbose docstrings to concise one-liners
- Delete dead store ctx.hidden_size (written in forward, never read in backward)
- Simplify hidden_size detection: self.config.out_hidden_size
- Add requires_grad_() for empty rank to participate in backward all_reduce
- Add .contiguous() guard before all_reduce (NCCL requirement)
- Reuse get_image_patch_counts in spatial_merge_size==1 path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace isinstance(tuple) check with model attribute detection
(hasattr deepstack_merger_list). Empty ranks now create matching
empty deepstack tensors and participate in all-gather, preventing
NCCL deadlock when num_images < dp_size.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add `vision_dp: bool = False` to ModelArguments and gate
apply_vision_dp_patch() calls in both DeepSpeedInferStrategy and
DeepSpeedTrainStrategy behind it. Vision DP is now opt-in.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace `expected_patches = end_patch - start_patch` (always-true by
  Python slicing) with independent cross-check via
  `get_image_patch_counts(local_grid_thw)` in prepare_local_vision_inputs()
- Rename tests to `test_<what>_<condition>_<expected>()` convention
- Add missing tests: embedding_counts empty, contiguous coverage,
  gather same-storage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sync shared utility functions with verl's stricter error handling:

- get_image_patch_counts/get_image_embedding_counts: empty grid_thw
  raises ValueError instead of returning []
- assign_images_to_dp_ranks: validate dp_size > 0, empty patch_counts
  raises ValueError instead of returning empty lists
- prepare_local_vision_inputs: add dp_rank bounds check, use tensor-ops
  for offset computation (avoid Python-list round-trip), add int() cast
- GatherVisionEmbeddings.forward: dp_size<=1 raises RuntimeError,
  validate all_counts length, max_count==0 raises RuntimeError
- GatherVisionEmbeddings.backward: assert dp_size>1, add CUDA check
- dp_vision_forward: sp_size<=1 raises RuntimeError, use
  GatherVisionEmbeddings.apply() directly, add detailed assert messages
- Update tests to match: empty→raises, add dp_size/dp_rank validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Call apply_vision_dp_patch() in fsdp2_strategy.py after set_upg_manager(),
mirroring the existing pattern in deepspeed_strategy.py. This ensures
Vision DP works correctly with FSDP2, not just DeepSpeed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant