feat(models): shard vision encoder across Ulysses SP ranks#929
feat(models): shard vision encoder across Ulysses SP ranks#929rchardx merged 14 commits intoinclusionAI:mainfrom
Conversation
…ss Ulysses SP ranks When using Ulysses SP (sp_size > 1), the ViT processes ALL images on every SP rank redundantly. Vision DP distributes whole images across SP ranks and all-gathers the embeddings, reducing ViT memory from O(total_images) to O(total_images/N). Key design: - Image-level distribution (not patch-level) to preserve cu_seqlens semantics - Contiguous assignment: rank k gets images [start_k, ..., end_k], no reorder needed after all-gather - GatherVisionEmbeddings autograd Function with gradient scaling to compensate for partial processing before FSDP reduction - No explicit gradient sync needed: AReaL's FSDP uses dp_sp submesh which covers both DP and SP dimensions Adapted from verl PR #5230. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello @aoshen524, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces Vision Data Parallelism (Vision DP) to significantly optimize memory consumption for Vision Transformers (ViT) within Visual Language Models (VLMs) when operating with Ulysses Sequence Parallelism. By distributing images across SP ranks rather than processing all images on every rank, it effectively scales down ViT memory requirements. The integration is seamless for supported VLMs and leverages existing FSDP mechanisms for gradient synchronization. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces Vision Data Parallelism (Vision DP) to distribute Vision Transformer (ViT) computation across sequence parallelism ranks, which is a great feature for reducing memory usage in Vision-Language Models. The implementation is well-structured, with the core logic encapsulated in areal/utils/vision_dp.py, integration via monkey-patching, and comprehensive unit tests. The code is clean and well-documented. I have one suggestion to improve the performance and readability of the patch extraction logic in prepare_local_vision_inputs by leveraging tensorized operations.
rchardx
left a comment
There was a problem hiding this comment.
Thanks for the clean implementation! The design is well-documented and the contiguous assignment approach is elegant. I have a few questions and concerns I'd like to discuss — particularly around gradient flow in the backward pass. Please see inline comments.
1. Fix gradient routing bug: add all_reduce(SUM) in backward before slicing to aggregate gradient contributions from all SP ranks. Remove the incorrect `* dp_size` scaling and `grad_scaler` param. 2. Load balancing: replace count-based chunking with greedy contiguous bin-packing by patch load (dynamic target per rank). 3. Remove unnecessary all_gather: compute per-rank embedding counts locally since grid_thw is replicated on all ranks. 4. Add idempotency guard: extract _patch_vision_class() helper with _vision_dp_patched attribute to prevent double-wrapping. 5. Remove unreachable qwen3_vl_moe code block. 6. Move grid_thw to CPU at dp_vision_forward entry to avoid repeated GPU→CPU syncs in metadata helpers. 7. Replace Python loops with tensor slicing in prepare_local_vision_inputs (cumsum + contiguous slice). 8. Improve tests: rename to test_<what>_<condition>_<expected> convention, add load balance test, add gather_vision_embeddings passthrough test, update assignments to use contiguous indices. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d contiguous guard - Trim verbose docstrings to concise one-liners - Delete dead store ctx.hidden_size (written in forward, never read in backward) - Simplify hidden_size detection: self.config.out_hidden_size - Add requires_grad_() for empty rank to participate in backward all_reduce - Add .contiguous() guard before all_reduce (NCCL requirement) - Reuse get_image_patch_counts in spatial_merge_size==1 path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Detect Qwen3-VL via model attribute (hasattr deepstack_merger_list) instead of return type, so empty ranks that skip original_forward still create matching empty deepstack tensors and participate in all-gather — preventing NCCL deadlock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add `vision_dp` field to TrainEngineConfig (default False). Vision DP is now only applied when explicitly enabled, instead of unconditionally when context_parallel_size > 1. Config flow: TrainEngineConfig.vision_dp → FSDPEngine → apply_monkey_patch → apply_vision_dp_patch(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-check The assertion `tensor[a:b].shape[0] == b - a` always passes by Python slicing semantics. Replace with an independent verification path: compute expected patches from `get_image_patch_counts(local_grid_thw)` instead of from the same `offsets` used for slicing. Addresses review comment: https://github.com/inclusionAI/AReaL/pull/929/files#r2863271094 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename tests to follow `test_<what>_<condition>_<expected>()` convention for self-documenting test output in CI logs - Add TestCreateDpVisionForward: verify sp_size<=1 calls original_forward - Add TestPatchVisionClass: verify forward replaced + idempotency guard - Add TestApplyVisionDpPatch: verify ImportError does not crash - Add gather_vision_embeddings same-storage passthrough test - Add embedding_counts empty input test - Add contiguous coverage test across multiple dp_sizes Addresses review comments: - https://github.com/inclusionAI/AReaL/pull/929/files#r2845688349 - https://github.com/inclusionAI/AReaL/pull/929/files#r2845665917 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # tests/test_vision_dp.py
The "vision_dp" name was misleading — the feature shards vision encoding across SP (sequence parallel) ranks, not DP ranks. Rename globally for clarity, and fix several bugs found in review. Key changes: - Rename vision_dp -> shard_vision_across_sp across config, code, docs, tests - Fix NCCL deadlock: add .requires_grad_() on empty deepstack tensors - Fix perf: pass grid_thw_cpu instead of GPU tensor to avoid 4x cudaStreamSync - Fix latent issue: add .contiguous() guard before all_gather on no-padding path - Add warnings for silent no-ops (sp_size<=1, non-VLM, zero classes patched) Refs: inclusionAI#929
Move vision_sp_shard.py from areal/utils/ to areal/models/fsdp/ to co-locate with ulysses.py. Also move shard_vision_across_sp config field from TrainEngineConfig to FSDPEngineConfig where it belongs. Key changes: - Move areal/utils/vision_sp_shard.py -> areal/models/fsdp/vision_sp_shard.py - Move shard_vision_across_sp field to FSDPEngineConfig, access via self.config.fsdp - Rename internal helpers with _ prefix (not part of public API) - Add defensive .cpu() guard in _get_image_patch_counts/_get_image_embedding_counts - Polish module docstring - Update all imports, mock paths, and docs tables Refs: inclusionAI#929
Extract _unpack_deepstack helper, simplify spatial_merge_size lookup, table-driven _VISION_CLASSES registry, remove redundant .cpu() guards and tautological assertion. Trim private docstrings. Key changes: - Extract _unpack_deepstack() to isolate Qwen3-VL deepstack logic - Simplify spatial_merge_size to getattr(self, "spatial_merge_size", 1) - Fix hidden_size fallback: out_hidden_size or hidden_size - Remove redundant .cpu() in _get_image_patch_counts/_get_image_embedding_counts - Remove tautological cross-check assertion in _prepare_local_vision_inputs - Refactor apply_vision_sp_shard_patch to table-driven _VISION_CLASSES registry - Trim multi-line docstrings on private functions to concise one-liners Refs: inclusionAI#929
…ests vision_sp_shard patches transformers model classes, so it belongs in models/transformers/ not models/fsdp/. Also improve test quality. Key changes: - Move vision_sp_shard.py from models/fsdp/ to models/transformers/ - Update import paths in ulyssess_patch.py and test file - Add _unpack_deepstack tests (3 code paths incl. NCCL deadlock fix) - Extract _assert_all_images_assigned helper to DRY 4 callsites - Add merge_size=1 equivalence test for refactored code path - Merge duplicate gather test, simplify apply_patch test - Remove "Adapted from verl" attribution from test docstring Refs: inclusionAI#929
When using Ulysses Sequence Parallelism with VLMs, every SP rank redundantly runs the full vision encoder. This adds a `shard_vision_across_sp` option that distributes whole images across SP ranks, runs ViT locally, and all-gathers the embeddings - eliminating redundant computation while preserving gradient correctness via all_reduce(SUM) in backward. Key changes: - Add vision_sp_shard.py with greedy contiguous image assignment, padded all-gather, and custom autograd backward - Support Qwen2-VL, Qwen2.5-VL, and Qwen3-VL (incl. deepstack) - Add shard_vision_across_sp flag to FSDPEngineConfig (effective only when context_parallel_size > 1) - Monkey-patch VisionTransformer.forward via table-driven registry - Add 31 CPU-only unit tests covering all helper functions, deepstack unpacking, patching idempotency, and integration --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Wentai Zhang <zhangwentai.zwt@antgroup.com>
Summary
O(total_images)toO(total_images/N)whensp_size > 1.vision_dp: boolinTrainEngineConfig— opt-in, default behavior unchanged.Design
GatherVisionEmbeddingsautograd Function; embedding counts computed locally (no all_gather)all_reduce(SUM)across SP group in backward before slicing — recovers complete gradient for each image from all sequence shardsapply_monkey_patch()inulyssess_patch.pyFiles Changed
areal/utils/vision_dp.pyareal/models/transformers/ulyssess_patch.pyapply_vision_dp_patch()behindvision_dpflagareal/api/cli_args.pyvision_dp: booltoTrainEngineConfigareal/engine/fsdp_engine.pyvision_dp=self.config.vision_dptoapply_monkey_patch()areal/tests/test_vision_dp.pyUsage
--train_engine.vision_dp true --train_engine.ulysses_sp_size 2Precision Alignment (verl reference experiment)
Validated in verl-project/verl#5230 under controlled conditions (same algorithm shared across frameworks):
all_reduce(SUM)changes FP accumulation order in vision backwardAdapted from verl PR #5230.
🤖 Generated with Claude Code