Skip to content

feat(vision): add Vision DP for parallel ViT computation across SP ranks#5230

Open
aoshen524 wants to merge 11 commits intoverl-project:mainfrom
aoshen524:feat/vision-dp-parallel
Open

feat(vision): add Vision DP for parallel ViT computation across SP ranks#5230
aoshen524 wants to merge 11 commits intoverl-project:mainfrom
aoshen524:feat/vision-dp-parallel

Conversation

@aoshen524
Copy link
Contributor

@aoshen524 aoshen524 commented Feb 7, 2026

Vision Data Parallel: Distribute ViT computation across Ulysses SP ranks

Motivation

When using Ulysses Sequence Parallelism (sp_size > 1), the attention layers split across SP ranks via all-to-all, but the VisionTransformer (ViT) still processes ALL images on every rank, wasting memory proportional to total_images.

Vision DP fixes this by distributing whole images (not patches) across SP ranks:

  • Before: Each of N SP ranks processes ALL images → ViT memory = O(total_images)
  • After: Each rank processes total_images/N images → ViT memory = O(total_images/N)

Design

  1. Image-level distribution: Avoids breaking ViT's internal cu_seqlens tracking
  2. Contiguous assignment: rank 0 gets images [0,1,...], rank 1 gets next chunk → no reordering after all-gather
  3. Load-balanced bin-packing: Greedy contiguous assignment that balances total patch load (not just image count) across ranks
  4. Correct gradient routing: all_reduce(SUM) in backward aggregates partial sequence gradients from all SP ranks before slicing by image assignment

Key changes

File Change
verl/utils/vision_dp.py Core utilities: assignment, slicing, all-gather with gradient support
verl/models/transformers/monkey_patch.py Integration: patch VisionTransformer classes with idempotency guard
tests/utils/test_vision_dp_on_cpu.py CPU-only unit tests (21 tests, no distributed required)

Tests

python -m pytest tests/utils/test_vision_dp_on_cpu.py -v
# 21 passed

Precision Alignment: Vision DP On vs Off

Experiment Setup

Config Value
Model Qwen2.5-VL-7B (SFT checkpoint)
GPUs 4 × H100 (1 node)
SP size 2
Strategy FSDP2
Train batch size 4
Rollout N 2
Max assistant turns 4
Dummy mode full (fixed multi-image prompts with meaningful screenshots + fixed text)
Determinism enabled (seed=42, TF32 disabled, CUBLAS_WORKSPACE_CONFIG=:4096:8)
Advantage hardcoded to 1.0 to remove reward randomness
Training steps 1

Why fix advantage? In RL training, reward randomness from rollouts causes gradient divergence unrelated to the optimization under test. Hardcoding advantage isolates the effect of Vision DP on gradient computation.

Comparison Method

  1. Online: Both runs dump per-parameter gradient tensors (as .pt files) at the optimizer step via VERL_PRECISION_DUMP_GRADS env var. Each run auto-creates a UUID subdirectory.
  2. Online: Per-group SHA256 hash fingerprints are logged at 3 phases: pre_clip (raw grad), before (post-clip), after (post-step), along with L2 norm (fp64) and absmax scalars.
  3. Offline: tools/compare_grads.py loads the dumped tensors and computes per-parameter max(|A-B|), mean(|A-B|), and cosine similarity, aggregated by scope (vision/language/other).

Results: Hash-Level (62 groups × 3 phases)

Phase Param SHA256 Grad SHA256 Match Vision Grad MM Language Grad MM
pre_clip (raw grad) 62/62 ✅ 29/62 32 0
before (post-clip) 62/62 ✅ 0/62 32 29
after (post-step) 0/62 0/62 32 29

Language gradients are bitwise identical at pre_clip — Vision DP does not affect the language model's gradient computation. The before/after phase divergence is expected: global grad norm (used for clipping) includes vision gradients, so a different clip factor propagates to all parameters.

Results: Per-Parameter Element-wise Diff

Scope Params max_diff mean_diff cosine_sim
vision 390 4.70e-05 2.93e-08 0.9991
language 338 9.50e-08 1.15e-10 1.0020
other 1 9.13e-08 2.25e-13 1.0001

Top-5 worst parameters (all vision scope):

Parameter max_diff
model.visual.patch_embed.proj.weight 4.70e-05
model.visual.blocks.7.attn.qkv.weight 1.99e-05
model.visual.blocks.1.mlp.gate_proj.weight 1.59e-05
model.visual.blocks.0.mlp.gate_proj.weight 1.18e-05
model.visual.blocks.1.norm1.weight 1.13e-05

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Vision Data Parallelism (DP) to enable parallel ViT computation by distributing images across sequence parallel ranks. The implementation is well-structured, with new utilities in verl/utils/vision_dp.py and corresponding monkey-patching logic. The addition of comprehensive unit tests in tests/test_vision_dp.py is commendable.

My review focuses on improving code quality, robustness, and maintainability. I've identified a few areas with high-severity issues:

  • Redundant computation in prepare_local_vision_inputs.
  • Fragile logic for determining hidden_size in the DP wrapper, which could lead to runtime errors with different models.
  • Significant code duplication in monkey_patch.py that should be refactored.

Addressing these points will make the new functionality more robust and easier to maintain in the future.

Comment on lines +407 to +437
if ulysses_sp_size > 1:
from verl.utils.vision_dp import create_dp_vision_forward

# Patch Qwen2-VL VisionTransformer
try:
from transformers.models.qwen2_vl.modeling_qwen2_vl import Qwen2VisionTransformerPretrainedModel

original_vision_forward = Qwen2VisionTransformerPretrainedModel.forward
Qwen2VisionTransformerPretrainedModel.forward = create_dp_vision_forward(original_vision_forward)
print(
f"Monkey patch Qwen2VisionTransformerPretrainedModel.forward"
f" for Vision DP (dp_size={ulysses_sp_size})"
)
except ImportError as e:
print(f"Warning: Could not patch Qwen2VisionTransformer for Vision DP: {e}")

# Patch Qwen2.5-VL VisionTransformer (uses a different class)
try:
from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import (
Qwen2_5_VisionTransformerPretrainedModel,
)

original_vision_forward_25 = Qwen2_5_VisionTransformerPretrainedModel.forward
Qwen2_5_VisionTransformerPretrainedModel.forward = create_dp_vision_forward(original_vision_forward_25)
print(
f"Monkey patch Qwen2_5_VisionTransformerPretrainedModel.forward"
f" for Vision DP (dp_size={ulysses_sp_size})"
)
except ImportError as e:
print(f"Warning: Could not patch Qwen2_5VisionTransformer for Vision DP: {e}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code for monkey-patching the vision transformer is repeated multiple times in this file (here for Qwen2/2.5-VL, and again for Qwen3 models). This duplication makes the code harder to read and maintain. Please consider refactoring this logic into a helper function that takes the module path and class name as arguments. This would significantly reduce code duplication and improve maintainability.

For example, a helper could look like this:

def _patch_vision_model_for_dp(module_path, class_name, ulysses_sp_size):
    try:
        module = __import__(module_path, fromlist=[class_name])
        model_class = getattr(module, class_name)
        original_forward = model_class.forward
        model_class.forward = create_dp_vision_forward(original_forward)
        print(f"Monkey patch {class_name}.forward for Vision DP (dp_size={ulysses_sp_size})")
    except (ImportError, AttributeError) as e:
        print(f"Warning: Could not patch {class_name} for Vision DP: {e}")

Additionally, the step numbering is inconsistent ("Step 4" here, but "Step 3" for the Qwen3 models). A refactor would help resolve such inconsistencies.

)

# Compute patch offsets for each image
patch_counts = (grid_thw[:, 0] * grid_thw[:, 1] * grid_thw[:, 2]).tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The patch_counts are recalculated here, but they have already been computed in the calling function create_dp_vision_forward. This leads to redundant computation. You can improve efficiency and code clarity by passing patch_counts as an argument to prepare_local_vision_inputs and removing this line.

Comment on lines +387 to +406
# This rank has no images, create empty tensor with correct hidden size
# Try multiple common attribute paths for hidden size detection
if hasattr(self, "merger") and hasattr(self.merger, "ln_q"):
ln_q = self.merger.ln_q
if hasattr(ln_q, "normalized_shape"):
hidden_size = ln_q.normalized_shape[0]
elif hasattr(ln_q, "weight"):
hidden_size = ln_q.weight.shape[0]
else:
raise RuntimeError(f"Cannot determine hidden_size from ln_q. Type: {type(ln_q).__name__}")
elif hasattr(self, "out_hidden_size"):
hidden_size = self.out_hidden_size
elif hasattr(self, "config") and hasattr(self.config, "hidden_size"):
hidden_size = self.config.hidden_size
else:
raise RuntimeError(
f"Cannot determine hidden_size for VisionTransformer. "
f"Model type: {type(self).__name__}. "
f"Available attributes: {[a for a in dir(self) if not a.startswith('_')]}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to determine hidden_size for ranks with no images is complex and relies on a series of hasattr checks for specific model attributes. This approach is brittle and may break with new or refactored models, leading to RuntimeError. A more robust approach would be to make this explicit. For example, create_dp_vision_forward could accept the hidden_size or a hidden_size_getter function as an argument. This would be provided at the patching site where the model context is clear, making the wrapper more reliable and easier to maintain.

When Ulysses sequence parallelism is enabled (sp_size > 1), the
VisionTransformer processes all images on every rank redundantly.
This adds Vision Data Parallel (DP) which distributes whole images
across SP ranks for independent ViT processing, then all-gathers
embeddings once at the end — reducing ViT peak memory by ~sp_size x.

Also removes the forced eager attention fallback for ViT when SP>1,
since Vision DP makes each rank process only its local images and
the _ulysses_flash_attention_forward already correctly skips ViT
(via position_ids is None guard).

Supports Qwen2-VL, Qwen2.5-VL, Qwen3-VL, and Qwen3-VL-MoE.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aoshen524 aoshen524 force-pushed the feat/vision-dp-parallel branch from 76ba823 to 8763b64 Compare February 7, 2026 13:56
aoshen524 added a commit to aoshen524/ROLL that referenced this pull request Feb 16, 2026
…es SP ranks

Distribute whole images across Ulysses SP ranks for parallelized ViT computation,
reducing ViT peak memory by ~sp_size x (e.g. SP=4 -> ~4x ViT memory reduction).

Key changes:
- Add roll/utils/context_parallel/vision_dp.py with image distribution utilities,
  GatherVisionEmbeddings autograd function, and model-agnostic VisionTransformer wrapper
- Add apply_vision_dp_patch() in monkey_patch.py for Qwen2-VL, Qwen2.5-VL, Qwen3-VL,
  Qwen3-VL-MoE VisionTransformer classes
- Integrate into DeepSpeed strategy (both inference and training workers)
- Add 17 unit tests covering all utility functions, edge cases, and integration workflows

Ported from verl (verl-project/verl#5230).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…issues

Address reviewer comments from PR verl-project#5230:

1. **Gradient routing fix (critical)**: Replace `grad_scaler * dp_size` with
   `all_reduce(SUM)` in GatherVisionEmbeddings.backward() to aggregate
   partial sequence gradients before slicing. Fixes silent gradient loss
   when vision tokens span multiple sequence shard boundaries.

2. **Load-balanced assignment**: Replace count-based chunking with greedy
   contiguous bin-packing that balances total patch load across ranks.
   Reduces worst-case imbalance from 8.5x to <6x for variable-resolution.

3. **Remove unnecessary all_gather**: Pass pre-computed `all_counts` from
   caller instead of doing all_gather in forward. grid_thw is replicated
   on all ranks, so embedding counts can be computed locally.

4. **Idempotency guard**: Add `_vision_dp_patched` attribute check to
   prevent double-wrapping when `apply_monkey_patch` is called multiple times.

5. **GPU→CPU sync optimization**: Move `grid_thw.cpu()` to dp_vision_forward
   entry point to avoid repeated `.tolist()` GPU→CPU syncs in helpers.

6. **Tensor slicing**: Replace Python loop + list append in
   prepare_local_vision_inputs with contiguous tensor slice using cumsum.

7. **Test improvements**: Rename tests to `test_<what>_<condition>_<expected>`,
   add load balancing test, add gather_none_group test, use parametrize.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
aoshen524 added a commit to aoshen524/ROLL that referenced this pull request Feb 26, 2026
…es SP ranks

Distribute whole images across Ulysses SP ranks for parallelized ViT computation,
reducing ViT peak memory by ~sp_size x (e.g. SP=4 -> ~4x ViT memory reduction).

Key changes:
- Add roll/utils/context_parallel/vision_dp.py with image distribution utilities,
  GatherVisionEmbeddings autograd function, and model-agnostic VisionTransformer wrapper
- Add apply_vision_dp_patch() in monkey_patch.py for Qwen2-VL, Qwen2.5-VL, Qwen3-VL,
  Qwen3-VL-MoE VisionTransformer classes
- Integrate into DeepSpeed strategy (both inference and training workers)
- Add 17 unit tests covering all utility functions, edge cases, and integration workflows

Ported from verl (verl-project/verl#5230).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
aoshen524 and others added 3 commits March 3, 2026 22:31
…ard bug

- Remove dead store ctx.hidden_size (unused in backward)
- Simplify hidden_size detection: 20-line hasattr chain → self.config.out_hidden_size
- Trim verbose docstrings to concise one-liners
- Add .contiguous() guard in backward for NCCL safety
- Fix P1: empty rank must call requires_grad_() to participate in backward all_reduce

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detect Qwen3-VL via model attribute (hasattr deepstack_merger_list)
instead of return type, so empty ranks that skip original_forward
still create matching empty deepstack tensors and participate in
all-gather — preventing NCCL deadlock.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add `actor_rollout_ref.model.vision_dp` (default False) to control
Vision DP activation. Previously Vision DP was unconditionally enabled
when ulysses_sp_size > 1.

Config flow: hf_model.yaml → fsdp_workers._build_model_optimizer →
apply_monkey_patch(vision_dp=...) → gate ViT class patching.

Also gates Qwen3-VL/Qwen3-VL-MoE Vision DP patching behind the same
flag for consistency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aoshen524
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Vision Data Parallelism (Vision DP) to distribute Vision Transformer (ViT) computation across sequence parallel ranks, which is a valuable feature for reducing memory consumption. The implementation includes core utilities for image assignment, a custom autograd function for correct gradient aggregation, and integration via monkey-patching. The changes are accompanied by comprehensive unit tests and a detailed precision analysis in the description. My review identified an opportunity to refactor duplicated code in the monkey-patching logic to improve maintainability.

Comment on lines +407 to +452
# Step 4: patch VisionTransformer for Vision DP (image-level distribution)
if ulysses_sp_size > 1:
if vision_dp:
from verl.utils.vision_dp import create_dp_vision_forward

# Patch Qwen2-VL VisionTransformer
try:
from transformers.models.qwen2_vl.modeling_qwen2_vl import Qwen2VisionTransformerPretrainedModel

if not getattr(Qwen2VisionTransformerPretrainedModel, "_vision_dp_patched", False):
original_vision_forward = Qwen2VisionTransformerPretrainedModel.forward
Qwen2VisionTransformerPretrainedModel.forward = create_dp_vision_forward(
original_vision_forward
)
Qwen2VisionTransformerPretrainedModel._vision_dp_patched = True
print(
f"Monkey patch Qwen2VisionTransformerPretrainedModel.forward"
f" for Vision DP (dp_size={ulysses_sp_size})"
)
except ImportError as e:
print(f"Warning: Could not patch Qwen2VisionTransformer for Vision DP: {e}")

# Patch Qwen2.5-VL VisionTransformer (uses a different class)
try:
from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import (
Qwen2_5_VisionTransformerPretrainedModel,
)

if not getattr(Qwen2_5_VisionTransformerPretrainedModel, "_vision_dp_patched", False):
original_vision_forward_25 = Qwen2_5_VisionTransformerPretrainedModel.forward
Qwen2_5_VisionTransformerPretrainedModel.forward = create_dp_vision_forward(
original_vision_forward_25
)
Qwen2_5_VisionTransformerPretrainedModel._vision_dp_patched = True
print(
f"Monkey patch Qwen2_5_VisionTransformerPretrainedModel.forward"
f" for Vision DP (dp_size={ulysses_sp_size})"
)
except ImportError as e:
print(f"Warning: Could not patch Qwen2_5VisionTransformer for Vision DP: {e}")
else:
print(
f"Vision DP disabled (vision_dp=False). "
f"ViT runs replicated on all {ulysses_sp_size} SP ranks."
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is significant code duplication in this block and in the qwen3_vl block below (lines 487-522) for applying the Vision DP monkey patch. This repetitive logic can be refactored into a helper function to improve maintainability and reduce redundancy.

A helper function could encapsulate the try...except block, the import, the idempotency check, and the patching logic itself. This would make the code cleaner and easier to extend to other vision models in the future.

For example, you could define a helper function like this:

def _patch_vision_model_for_dp(model_class_name: str, module_path: str, ulysses_sp_size: int):
    """Applies the Vision DP monkey patch to a given model class."""
    try:
        from verl.utils.vision_dp import create_dp_vision_forward
        module = __import__(module_path, fromlist=[model_class_name])
        model_class = getattr(module, model_class_name)

        if not getattr(model_class, "_vision_dp_patched", False):
            original_forward = model_class.forward
            model_class.forward = create_dp_vision_forward(original_forward)
            model_class._vision_dp_patched = True
            print(
                f"Monkey patch {model_class_name}.forward for Vision DP (dp_size={ulysses_sp_size})"
            )
    except (ImportError, AttributeError) as e:
        print(f"Warning: Could not patch {model_class_name} for Vision DP: {e}")

And then call it for each model:

# Step 4: patch VisionTransformer for Vision DP (image-level distribution)
if ulysses_sp_size > 1:
    if vision_dp:
        _patch_vision_model_for_dp(
            "Qwen2VisionTransformerPretrainedModel",
            "transformers.models.qwen2_vl.modeling_qwen2_vl",
            ulysses_sp_size
        )
        _patch_vision_model_for_dp(
            "Qwen2_5_VisionTransformerPretrainedModel",
            "transformers.models.qwen2_5_vl.modeling_qwen2_5_vl",
            ulysses_sp_size
        )
    else:
        print(
            f"Vision DP disabled (vision_dp=False). "
            f"ViT runs replicated on all {ulysses_sp_size} SP ranks."
        )

This refactoring would also apply to the qwen3_vl block.

aoshen524 and others added 5 commits March 4, 2026 10:50
- Add Vision DP monkey patch for Glm4vVisionModel (same interface as Qwen2-VL)
- Add vision_dp flag parameter to apply_monkey_patch for explicit gating
- Refine vision_dp.py: streamline hidden_size detection, simplify docstrings
- Add unit tests (test_vision_dp.py) and distributed tests (test_vision_dp_distributed.py)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-check

The assertion `tensor[a:b].shape[0] == b - a` always passes by Python
slicing semantics. Replace with an independent verification path:
compute expected patches from `get_image_patch_counts(local_grid_thw)`
instead of from the same `offsets` used for slicing.

Addresses review comment: https://github.com/inclusionAI/AReaL/pull/929/files#r2863271094

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename all tests to follow `test_<what>_<condition>_<expected>()` convention
  for self-documenting test output in CI failure logs
- Add TestCreateDpVisionForward: verify sp_size<=1 raises RuntimeError
- Add contiguous coverage test across multiple dp_sizes

Addresses review comments:
- https://github.com/inclusionAI/AReaL/pull/929/files#r2845688349
- https://github.com/inclusionAI/AReaL/pull/929/files#r2845665917

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix NCCL deadlock: add .requires_grad_() on empty deepstack tensors
- Fix contiguity: add .contiguous() guard in GatherVisionEmbeddings else branch
- Perf: pass grid_thw_cpu to prepare_local_vision_inputs, avoid GPU→CPU syncs
- Simplify: use getattr(self, "spatial_merge_size", 1) instead of multi-if
- Simplify: hidden_size fallback tries both out_hidden_size and hidden_size
- Extract: _unpack_deepstack() helper for cleaner deepstack handling
- Remove: non-CUDA RuntimeError check in backward (unnecessary guard)
- Test: add TestUnpackDeepstack (incl. requires_grad verification)
- Test: add embedding_counts_merge_1_equals_patch_counts consistency test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ing, fix assertion, add docs

- Pass pre-computed patch_counts to prepare_local_vision_inputs to avoid
  redundant grid_thw product computation
- Replace silent ImportError with logger.warning in monkey_patch.py so
  users know when Vision DP fails to apply
- Add try/except for Qwen3-VL and GLM-4V vision patches (previously would crash)
- Fix tautological assertion (end-start from slicing is always true) to
  verify against independently computed sum of per-image patch counts
- Add Vision DP documentation to perf_tuning.rst and config.rst

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ion_inputs

Simplify the code by removing the Optional fallback path. Since all
callers already have patch_counts computed, there's no need for the
function to recompute them internally. Update all test call sites.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aoshen524
Copy link
Contributor Author

aoshen524 commented Mar 12, 2026

@ISEEKYAN Hi, could you please help review this PR? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant