[None][fix] Fix FP8 per-tensor torch.compile graph break in dynamic quantization#11759
Conversation
📝 WalkthroughWalkthroughA namespace path is updated for a per-tensor quantization operator in the linear module. The call changes from Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…uantization The C++ op tensorrt_llm::quantize_e4m3_per_tensor (registered via TORCH_LIBRARY_FRAGMENT in fp8Op.cpp) lacks a register_fake implementation. Without register_fake, torch.compile's Dynamo tracer cannot infer output shape/dtype metadata, causing a graph break at every dynamic quantization call. Add register_fake for tensorrt_llm::quantize_e4m3_per_tensor in cpp_custom_ops.py, matching the pattern already used for the static variant (static_quantize_e4m3_per_tensor). Impact on FLUX.2 (B200, 1024x1024, 50 steps, torch.compile): - Before: 36 subgraphs, 491 traced nodes (~8% compile coverage) - After: 1 subgraph, 6,431 traced nodes (full compile coverage) No latency change observed (GEMMs dominate runtime), but the fix produces a correct monolithic FX graph that enables future Inductor optimizations requiring whole-graph visibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>
e9829e3 to
3e67458
Compare
|
/bot run |
|
PR_Github #36981 [ run ] triggered by Bot. Commit: |
|
PR_Github #36981 [ run ] completed with state |
|
Merged as discussed with @liji-nv ~ thank you! |
…uantization (NVIDIA#11759) Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…uantization (NVIDIA#11759) Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…uantization (NVIDIA#11759) Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
The
tensorrt_llm::quantize_e4m3_per_tensorlacks a register_fake implementation. Withoutregister_fake,torch.compile's Dynamo tracer cannot infer output shape/dtype metadata, causing a graph break at every dynamic quantization call.Added
register_fakefortensorrt_llm::quantize_e4m3_per_tensorincpp_custom_ops.py, matching the pattern already used for the static variant (static_quantize_e4m3_per_tensor).Test plan
Observation
Impact on FLUX.2 (B200, 1024x1024, 50 steps, torch.compile):
No latency change observed (GEMMs dominate runtime), but the fix produces a correct monolithic FX graph that enables future Inductor optimizations requiring whole-graph visibility.
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.