[None][refactor] parallel vae refactor#12123
Conversation
Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughThis pull request refactors the VAE parallelization system by replacing the adapter-based architecture with a factory-based approach. It introduces new abstractions ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes 🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tensorrt_llm/_torch/visual_gen/modules/vae/parallel_vae_interface.py (1)
127-145: Consider documenting process group lifecycle.The
_build_adj_groupscreates pairwise process groups that persist for the lifetime of the process. This is standard for distributed training, but a brief docstring note about cleanup responsibility (if any) would be helpful for maintainers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/modules/vae/parallel_vae_interface.py` around lines 127 - 145, Add a short note to the _build_adj_groups docstring documenting that the created ProcessGroup objects (returned in adj_groups) persist for the process lifetime and that callers are responsible for any explicit cleanup (if needed) or should rely on the process teardown to release them; reference the _build_adj_groups static method and the returned adj_groups/ProcessGroup objects so maintainers know where lifecycle and cleanup responsibility applies.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/visual_gen/pipeline.py`:
- Around line 204-219: The new process group created by dist.new_group is leaked
when ParallelVAEFactory.from_vae raises ValueError; ensure the group is
destroyed on failure by calling dist.destroy_process_group(pg) (or the
appropriate destroy function) inside the except ValueError block before
returning, referencing the created pg and the call to
ParallelVAEFactory.from_vae (and self.vae) so the cleanup runs only on the
failure path.
---
Nitpick comments:
In `@tensorrt_llm/_torch/visual_gen/modules/vae/parallel_vae_interface.py`:
- Around line 127-145: Add a short note to the _build_adj_groups docstring
documenting that the created ProcessGroup objects (returned in adj_groups)
persist for the process lifetime and that callers are responsible for any
explicit cleanup (if needed) or should rely on the process teardown to release
them; reference the _build_adj_groups static method and the returned
adj_groups/ProcessGroup objects so maintainers know where lifecycle and cleanup
responsibility applies.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 64e3f104-2807-4a54-9bf2-4065f4f8946d
📒 Files selected for processing (8)
tensorrt_llm/_torch/visual_gen/models/wan/__init__.pytensorrt_llm/_torch/visual_gen/models/wan/parallel_vae.pytensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.pytensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.pytensorrt_llm/_torch/visual_gen/modules/vae/__init__.pytensorrt_llm/_torch/visual_gen/modules/vae/parallel_vae_interface.pytensorrt_llm/_torch/visual_gen/parallelism.pytensorrt_llm/_torch/visual_gen/pipeline.py
|
PR_Github #38631 [ run ] triggered by Bot. Commit: |
There was a problem hiding this comment.
Thanks for the effort, @NVShreyas. Just a heads-up — not sure if this refactor/design could help enable parallel vae to LTX-2 in the future..
https://github.com/Lightricks/LTX-2/blob/ae855f8538843825f9015a419cf4ba5edaf5eec2/packages/ltx-core/src/ltx_core/model/video_vae/video_vae.py#L802
|
@chang-l I see LTX2 uses tiling. For that we would need parallel VAE using tile distribution which would need more changes. But the halo exchange and refactor should hold |
Summary by CodeRabbit
Release Notes
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.