feat(vision): add Vision DP for parallel ViT computation across SP ranks#5230
feat(vision): add Vision DP for parallel ViT computation across SP ranks#5230aoshen524 wants to merge 11 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces Vision Data Parallelism (DP) to enable parallel ViT computation by distributing images across sequence parallel ranks. The implementation is well-structured, with new utilities in verl/utils/vision_dp.py and corresponding monkey-patching logic. The addition of comprehensive unit tests in tests/test_vision_dp.py is commendable.
My review focuses on improving code quality, robustness, and maintainability. I've identified a few areas with high-severity issues:
- Redundant computation in
prepare_local_vision_inputs. - Fragile logic for determining
hidden_sizein the DP wrapper, which could lead to runtime errors with different models. - Significant code duplication in
monkey_patch.pythat should be refactored.
Addressing these points will make the new functionality more robust and easier to maintain in the future.
| if ulysses_sp_size > 1: | ||
| from verl.utils.vision_dp import create_dp_vision_forward | ||
|
|
||
| # Patch Qwen2-VL VisionTransformer | ||
| try: | ||
| from transformers.models.qwen2_vl.modeling_qwen2_vl import Qwen2VisionTransformerPretrainedModel | ||
|
|
||
| original_vision_forward = Qwen2VisionTransformerPretrainedModel.forward | ||
| Qwen2VisionTransformerPretrainedModel.forward = create_dp_vision_forward(original_vision_forward) | ||
| print( | ||
| f"Monkey patch Qwen2VisionTransformerPretrainedModel.forward" | ||
| f" for Vision DP (dp_size={ulysses_sp_size})" | ||
| ) | ||
| except ImportError as e: | ||
| print(f"Warning: Could not patch Qwen2VisionTransformer for Vision DP: {e}") | ||
|
|
||
| # Patch Qwen2.5-VL VisionTransformer (uses a different class) | ||
| try: | ||
| from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import ( | ||
| Qwen2_5_VisionTransformerPretrainedModel, | ||
| ) | ||
|
|
||
| original_vision_forward_25 = Qwen2_5_VisionTransformerPretrainedModel.forward | ||
| Qwen2_5_VisionTransformerPretrainedModel.forward = create_dp_vision_forward(original_vision_forward_25) | ||
| print( | ||
| f"Monkey patch Qwen2_5_VisionTransformerPretrainedModel.forward" | ||
| f" for Vision DP (dp_size={ulysses_sp_size})" | ||
| ) | ||
| except ImportError as e: | ||
| print(f"Warning: Could not patch Qwen2_5VisionTransformer for Vision DP: {e}") | ||
|
|
There was a problem hiding this comment.
This block of code for monkey-patching the vision transformer is repeated multiple times in this file (here for Qwen2/2.5-VL, and again for Qwen3 models). This duplication makes the code harder to read and maintain. Please consider refactoring this logic into a helper function that takes the module path and class name as arguments. This would significantly reduce code duplication and improve maintainability.
For example, a helper could look like this:
def _patch_vision_model_for_dp(module_path, class_name, ulysses_sp_size):
try:
module = __import__(module_path, fromlist=[class_name])
model_class = getattr(module, class_name)
original_forward = model_class.forward
model_class.forward = create_dp_vision_forward(original_forward)
print(f"Monkey patch {class_name}.forward for Vision DP (dp_size={ulysses_sp_size})")
except (ImportError, AttributeError) as e:
print(f"Warning: Could not patch {class_name} for Vision DP: {e}")Additionally, the step numbering is inconsistent ("Step 4" here, but "Step 3" for the Qwen3 models). A refactor would help resolve such inconsistencies.
verl/utils/vision_dp.py
Outdated
| ) | ||
|
|
||
| # Compute patch offsets for each image | ||
| patch_counts = (grid_thw[:, 0] * grid_thw[:, 1] * grid_thw[:, 2]).tolist() |
There was a problem hiding this comment.
The patch_counts are recalculated here, but they have already been computed in the calling function create_dp_vision_forward. This leads to redundant computation. You can improve efficiency and code clarity by passing patch_counts as an argument to prepare_local_vision_inputs and removing this line.
| # This rank has no images, create empty tensor with correct hidden size | ||
| # Try multiple common attribute paths for hidden size detection | ||
| if hasattr(self, "merger") and hasattr(self.merger, "ln_q"): | ||
| ln_q = self.merger.ln_q | ||
| if hasattr(ln_q, "normalized_shape"): | ||
| hidden_size = ln_q.normalized_shape[0] | ||
| elif hasattr(ln_q, "weight"): | ||
| hidden_size = ln_q.weight.shape[0] | ||
| else: | ||
| raise RuntimeError(f"Cannot determine hidden_size from ln_q. Type: {type(ln_q).__name__}") | ||
| elif hasattr(self, "out_hidden_size"): | ||
| hidden_size = self.out_hidden_size | ||
| elif hasattr(self, "config") and hasattr(self.config, "hidden_size"): | ||
| hidden_size = self.config.hidden_size | ||
| else: | ||
| raise RuntimeError( | ||
| f"Cannot determine hidden_size for VisionTransformer. " | ||
| f"Model type: {type(self).__name__}. " | ||
| f"Available attributes: {[a for a in dir(self) if not a.startswith('_')]}" | ||
| ) |
There was a problem hiding this comment.
The logic to determine hidden_size for ranks with no images is complex and relies on a series of hasattr checks for specific model attributes. This approach is brittle and may break with new or refactored models, leading to RuntimeError. A more robust approach would be to make this explicit. For example, create_dp_vision_forward could accept the hidden_size or a hidden_size_getter function as an argument. This would be provided at the patching site where the model context is clear, making the wrapper more reliable and easier to maintain.
When Ulysses sequence parallelism is enabled (sp_size > 1), the VisionTransformer processes all images on every rank redundantly. This adds Vision Data Parallel (DP) which distributes whole images across SP ranks for independent ViT processing, then all-gathers embeddings once at the end — reducing ViT peak memory by ~sp_size x. Also removes the forced eager attention fallback for ViT when SP>1, since Vision DP makes each rank process only its local images and the _ulysses_flash_attention_forward already correctly skips ViT (via position_ids is None guard). Supports Qwen2-VL, Qwen2.5-VL, Qwen3-VL, and Qwen3-VL-MoE. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
76ba823 to
8763b64
Compare
…es SP ranks Distribute whole images across Ulysses SP ranks for parallelized ViT computation, reducing ViT peak memory by ~sp_size x (e.g. SP=4 -> ~4x ViT memory reduction). Key changes: - Add roll/utils/context_parallel/vision_dp.py with image distribution utilities, GatherVisionEmbeddings autograd function, and model-agnostic VisionTransformer wrapper - Add apply_vision_dp_patch() in monkey_patch.py for Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-VL-MoE VisionTransformer classes - Integrate into DeepSpeed strategy (both inference and training workers) - Add 17 unit tests covering all utility functions, edge cases, and integration workflows Ported from verl (verl-project/verl#5230). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…issues Address reviewer comments from PR verl-project#5230: 1. **Gradient routing fix (critical)**: Replace `grad_scaler * dp_size` with `all_reduce(SUM)` in GatherVisionEmbeddings.backward() to aggregate partial sequence gradients before slicing. Fixes silent gradient loss when vision tokens span multiple sequence shard boundaries. 2. **Load-balanced assignment**: Replace count-based chunking with greedy contiguous bin-packing that balances total patch load across ranks. Reduces worst-case imbalance from 8.5x to <6x for variable-resolution. 3. **Remove unnecessary all_gather**: Pass pre-computed `all_counts` from caller instead of doing all_gather in forward. grid_thw is replicated on all ranks, so embedding counts can be computed locally. 4. **Idempotency guard**: Add `_vision_dp_patched` attribute check to prevent double-wrapping when `apply_monkey_patch` is called multiple times. 5. **GPU→CPU sync optimization**: Move `grid_thw.cpu()` to dp_vision_forward entry point to avoid repeated `.tolist()` GPU→CPU syncs in helpers. 6. **Tensor slicing**: Replace Python loop + list append in prepare_local_vision_inputs with contiguous tensor slice using cumsum. 7. **Test improvements**: Rename tests to `test_<what>_<condition>_<expected>`, add load balancing test, add gather_none_group test, use parametrize. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…es SP ranks Distribute whole images across Ulysses SP ranks for parallelized ViT computation, reducing ViT peak memory by ~sp_size x (e.g. SP=4 -> ~4x ViT memory reduction). Key changes: - Add roll/utils/context_parallel/vision_dp.py with image distribution utilities, GatherVisionEmbeddings autograd function, and model-agnostic VisionTransformer wrapper - Add apply_vision_dp_patch() in monkey_patch.py for Qwen2-VL, Qwen2.5-VL, Qwen3-VL, Qwen3-VL-MoE VisionTransformer classes - Integrate into DeepSpeed strategy (both inference and training workers) - Add 17 unit tests covering all utility functions, edge cases, and integration workflows Ported from verl (verl-project/verl#5230). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ard bug - Remove dead store ctx.hidden_size (unused in backward) - Simplify hidden_size detection: 20-line hasattr chain → self.config.out_hidden_size - Trim verbose docstrings to concise one-liners - Add .contiguous() guard in backward for NCCL safety - Fix P1: empty rank must call requires_grad_() to participate in backward all_reduce Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detect Qwen3-VL via model attribute (hasattr deepstack_merger_list) instead of return type, so empty ranks that skip original_forward still create matching empty deepstack tensors and participate in all-gather — preventing NCCL deadlock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add `actor_rollout_ref.model.vision_dp` (default False) to control Vision DP activation. Previously Vision DP was unconditionally enabled when ulysses_sp_size > 1. Config flow: hf_model.yaml → fsdp_workers._build_model_optimizer → apply_monkey_patch(vision_dp=...) → gate ViT class patching. Also gates Qwen3-VL/Qwen3-VL-MoE Vision DP patching behind the same flag for consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces Vision Data Parallelism (Vision DP) to distribute Vision Transformer (ViT) computation across sequence parallel ranks, which is a valuable feature for reducing memory consumption. The implementation includes core utilities for image assignment, a custom autograd function for correct gradient aggregation, and integration via monkey-patching. The changes are accompanied by comprehensive unit tests and a detailed precision analysis in the description. My review identified an opportunity to refactor duplicated code in the monkey-patching logic to improve maintainability.
| # Step 4: patch VisionTransformer for Vision DP (image-level distribution) | ||
| if ulysses_sp_size > 1: | ||
| if vision_dp: | ||
| from verl.utils.vision_dp import create_dp_vision_forward | ||
|
|
||
| # Patch Qwen2-VL VisionTransformer | ||
| try: | ||
| from transformers.models.qwen2_vl.modeling_qwen2_vl import Qwen2VisionTransformerPretrainedModel | ||
|
|
||
| if not getattr(Qwen2VisionTransformerPretrainedModel, "_vision_dp_patched", False): | ||
| original_vision_forward = Qwen2VisionTransformerPretrainedModel.forward | ||
| Qwen2VisionTransformerPretrainedModel.forward = create_dp_vision_forward( | ||
| original_vision_forward | ||
| ) | ||
| Qwen2VisionTransformerPretrainedModel._vision_dp_patched = True | ||
| print( | ||
| f"Monkey patch Qwen2VisionTransformerPretrainedModel.forward" | ||
| f" for Vision DP (dp_size={ulysses_sp_size})" | ||
| ) | ||
| except ImportError as e: | ||
| print(f"Warning: Could not patch Qwen2VisionTransformer for Vision DP: {e}") | ||
|
|
||
| # Patch Qwen2.5-VL VisionTransformer (uses a different class) | ||
| try: | ||
| from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import ( | ||
| Qwen2_5_VisionTransformerPretrainedModel, | ||
| ) | ||
|
|
||
| if not getattr(Qwen2_5_VisionTransformerPretrainedModel, "_vision_dp_patched", False): | ||
| original_vision_forward_25 = Qwen2_5_VisionTransformerPretrainedModel.forward | ||
| Qwen2_5_VisionTransformerPretrainedModel.forward = create_dp_vision_forward( | ||
| original_vision_forward_25 | ||
| ) | ||
| Qwen2_5_VisionTransformerPretrainedModel._vision_dp_patched = True | ||
| print( | ||
| f"Monkey patch Qwen2_5_VisionTransformerPretrainedModel.forward" | ||
| f" for Vision DP (dp_size={ulysses_sp_size})" | ||
| ) | ||
| except ImportError as e: | ||
| print(f"Warning: Could not patch Qwen2_5VisionTransformer for Vision DP: {e}") | ||
| else: | ||
| print( | ||
| f"Vision DP disabled (vision_dp=False). " | ||
| f"ViT runs replicated on all {ulysses_sp_size} SP ranks." | ||
| ) | ||
|
|
There was a problem hiding this comment.
There is significant code duplication in this block and in the qwen3_vl block below (lines 487-522) for applying the Vision DP monkey patch. This repetitive logic can be refactored into a helper function to improve maintainability and reduce redundancy.
A helper function could encapsulate the try...except block, the import, the idempotency check, and the patching logic itself. This would make the code cleaner and easier to extend to other vision models in the future.
For example, you could define a helper function like this:
def _patch_vision_model_for_dp(model_class_name: str, module_path: str, ulysses_sp_size: int):
"""Applies the Vision DP monkey patch to a given model class."""
try:
from verl.utils.vision_dp import create_dp_vision_forward
module = __import__(module_path, fromlist=[model_class_name])
model_class = getattr(module, model_class_name)
if not getattr(model_class, "_vision_dp_patched", False):
original_forward = model_class.forward
model_class.forward = create_dp_vision_forward(original_forward)
model_class._vision_dp_patched = True
print(
f"Monkey patch {model_class_name}.forward for Vision DP (dp_size={ulysses_sp_size})"
)
except (ImportError, AttributeError) as e:
print(f"Warning: Could not patch {model_class_name} for Vision DP: {e}")And then call it for each model:
# Step 4: patch VisionTransformer for Vision DP (image-level distribution)
if ulysses_sp_size > 1:
if vision_dp:
_patch_vision_model_for_dp(
"Qwen2VisionTransformerPretrainedModel",
"transformers.models.qwen2_vl.modeling_qwen2_vl",
ulysses_sp_size
)
_patch_vision_model_for_dp(
"Qwen2_5_VisionTransformerPretrainedModel",
"transformers.models.qwen2_5_vl.modeling_qwen2_5_vl",
ulysses_sp_size
)
else:
print(
f"Vision DP disabled (vision_dp=False). "
f"ViT runs replicated on all {ulysses_sp_size} SP ranks."
)This refactoring would also apply to the qwen3_vl block.
- Add Vision DP monkey patch for Glm4vVisionModel (same interface as Qwen2-VL) - Add vision_dp flag parameter to apply_monkey_patch for explicit gating - Refine vision_dp.py: streamline hidden_size detection, simplify docstrings - Add unit tests (test_vision_dp.py) and distributed tests (test_vision_dp_distributed.py) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-check The assertion `tensor[a:b].shape[0] == b - a` always passes by Python slicing semantics. Replace with an independent verification path: compute expected patches from `get_image_patch_counts(local_grid_thw)` instead of from the same `offsets` used for slicing. Addresses review comment: https://github.com/inclusionAI/AReaL/pull/929/files#r2863271094 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename all tests to follow `test_<what>_<condition>_<expected>()` convention for self-documenting test output in CI failure logs - Add TestCreateDpVisionForward: verify sp_size<=1 raises RuntimeError - Add contiguous coverage test across multiple dp_sizes Addresses review comments: - https://github.com/inclusionAI/AReaL/pull/929/files#r2845688349 - https://github.com/inclusionAI/AReaL/pull/929/files#r2845665917 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix NCCL deadlock: add .requires_grad_() on empty deepstack tensors - Fix contiguity: add .contiguous() guard in GatherVisionEmbeddings else branch - Perf: pass grid_thw_cpu to prepare_local_vision_inputs, avoid GPU→CPU syncs - Simplify: use getattr(self, "spatial_merge_size", 1) instead of multi-if - Simplify: hidden_size fallback tries both out_hidden_size and hidden_size - Extract: _unpack_deepstack() helper for cleaner deepstack handling - Remove: non-CUDA RuntimeError check in backward (unnecessary guard) - Test: add TestUnpackDeepstack (incl. requires_grad verification) - Test: add embedding_counts_merge_1_equals_patch_counts consistency test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ing, fix assertion, add docs - Pass pre-computed patch_counts to prepare_local_vision_inputs to avoid redundant grid_thw product computation - Replace silent ImportError with logger.warning in monkey_patch.py so users know when Vision DP fails to apply - Add try/except for Qwen3-VL and GLM-4V vision patches (previously would crash) - Fix tautological assertion (end-start from slicing is always true) to verify against independently computed sum of per-image patch counts - Add Vision DP documentation to perf_tuning.rst and config.rst Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ion_inputs Simplify the code by removing the Optional fallback path. Since all callers already have patch_counts computed, there's no need for the function to recompute them internally. Update all test call sites. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@ISEEKYAN Hi, could you please help review this PR? Thanks! |
Vision Data Parallel: Distribute ViT computation across Ulysses SP ranks
Motivation
When using Ulysses Sequence Parallelism (sp_size > 1), the attention layers split across SP ranks via all-to-all, but the VisionTransformer (ViT) still processes ALL images on every rank, wasting memory proportional to
total_images.Vision DP fixes this by distributing whole images (not patches) across SP ranks:
Design
all_reduce(SUM)in backward aggregates partial sequence gradients from all SP ranks before slicing by image assignmentKey changes
verl/utils/vision_dp.pyverl/models/transformers/monkey_patch.pytests/utils/test_vision_dp_on_cpu.pyTests
python -m pytest tests/utils/test_vision_dp_on_cpu.py -v # 21 passedPrecision Alignment: Vision DP On vs Off
Experiment Setup
Comparison Method
.ptfiles) at the optimizer step viaVERL_PRECISION_DUMP_GRADSenv var. Each run auto-creates a UUID subdirectory.pre_clip(raw grad),before(post-clip),after(post-step), along with L2 norm (fp64) and absmax scalars.tools/compare_grads.pyloads the dumped tensors and computes per-parametermax(|A-B|),mean(|A-B|), and cosine similarity, aggregated by scope (vision/language/other).Results: Hash-Level (62 groups × 3 phases)
Results: Per-Parameter Element-wise Diff
Top-5 worst parameters (all vision scope):
model.visual.patch_embed.proj.weightmodel.visual.blocks.7.attn.qkv.weightmodel.visual.blocks.1.mlp.gate_proj.weightmodel.visual.blocks.0.mlp.gate_proj.weightmodel.visual.blocks.1.norm1.weight