Skip to content

[None][fix] Fix FP8 per-tensor torch.compile graph break in dynamic quantization#11759

Merged
karljang merged 2 commits intoNVIDIA:mainfrom
karljang:user/kanghwan/fix-fp8-per-tensor-graph-break
Mar 2, 2026
Merged

[None][fix] Fix FP8 per-tensor torch.compile graph break in dynamic quantization#11759
karljang merged 2 commits intoNVIDIA:mainfrom
karljang:user/kanghwan/fix-fp8-per-tensor-graph-break

Conversation

@karljang
Copy link
Collaborator

@karljang karljang commented Feb 26, 2026

Summary

The tensorrt_llm::quantize_e4m3_per_tensor lacks a register_fake implementation. Without register_fake, torch.compile's Dynamo tracer cannot infer output shape/dtype metadata, causing a graph break at every dynamic quantization call.

Added register_fake for tensorrt_llm::quantize_e4m3_per_tensor in cpp_custom_ops.py, matching the pattern already used for the static variant (static_quantize_e4m3_per_tensor).

Test plan

  • Verify FP8 per-tensor eager mode produces identical outputs (same kernel)
  • Verify torch.compile produces single monolithic FX graph
  • Run existing FP8 per-tensor unit tests

Observation

Impact on FLUX.2 (B200, 1024x1024, 50 steps, torch.compile):

  • Before: 36 subgraphs, 491 traced nodes (~3% compile coverage)
  • After: 1 subgraph, 6,431 traced nodes (full compile coverage)

No latency change observed (GEMMs dominate runtime), but the fix produces a correct monolithic FX graph that enables future Inductor optimizations requiring whole-graph visibility.

Summary by CodeRabbit

  • Chores
    • Internal update to quantization operation namespace path. No changes to user-facing functionality or behavior.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@karljang karljang requested a review from a team as a code owner February 26, 2026 22:27
@karljang karljang requested a review from yuxianq February 26, 2026 22:27
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2026

📝 Walkthrough

Walkthrough

A namespace path is updated for a per-tensor quantization operator in the linear module. The call changes from torch.ops.tensorrt_llm.quantize_e4m3_per_tensor to torch.ops.trtllm.quantize_e4m3_per_tensor, with arguments and return values remaining unchanged.

Changes

Cohort / File(s) Summary
Quantization Operator Namespace Update
tensorrt_llm/_torch/modules/linear.py
Updated operator namespace from torch.ops.tensorrt_llm to torch.ops.trtllm for the quantize_e4m3_per_tensor call during dynamic quantization.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly describes the main change: fixing FP8 per-tensor torch.compile graph fragmentation in dynamic quantization by switching operator namespaces.
Description check ✅ Passed PR description clearly explains the issue, solution, and test coverage with specific metrics demonstrating the fix's impact.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

…uantization

The C++ op tensorrt_llm::quantize_e4m3_per_tensor (registered via
TORCH_LIBRARY_FRAGMENT in fp8Op.cpp) lacks a register_fake
implementation. Without register_fake, torch.compile's Dynamo tracer
cannot infer output shape/dtype metadata, causing a graph break at
every dynamic quantization call.

Add register_fake for tensorrt_llm::quantize_e4m3_per_tensor in
cpp_custom_ops.py, matching the pattern already used for the static
variant (static_quantize_e4m3_per_tensor).

Impact on FLUX.2 (B200, 1024x1024, 50 steps, torch.compile):
- Before: 36 subgraphs, 491 traced nodes (~8% compile coverage)
- After: 1 subgraph, 6,431 traced nodes (full compile coverage)

No latency change observed (GEMMs dominate runtime), but the fix
produces a correct monolithic FX graph that enables future Inductor
optimizations requiring whole-graph visibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>
@karljang karljang force-pushed the user/kanghwan/fix-fp8-per-tensor-graph-break branch from e9829e3 to 3e67458 Compare February 26, 2026 22:58
@karljang karljang requested a review from a team as a code owner February 26, 2026 22:58
@karljang karljang requested a review from liji-nv February 26, 2026 22:58
@karljang
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36981 [ run ] triggered by Bot. Commit: 2bdd317 Link to invocation

@karljang karljang requested a review from yuxianq February 27, 2026 04:17
@tensorrt-cicd
Copy link
Collaborator

PR_Github #36981 [ run ] completed with state SUCCESS. Commit: 2bdd317
/LLM/main/L0_MergeRequest_PR pipeline #28634 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Link to invocation

@karljang karljang merged commit 9013b58 into NVIDIA:main Mar 2, 2026
7 checks passed
@karljang
Copy link
Collaborator Author

karljang commented Mar 2, 2026

Merged as discussed with @liji-nv ~ thank you!

@karljang karljang deleted the user/kanghwan/fix-fp8-per-tensor-graph-break branch March 2, 2026 08:27
greg-kwasniewski1 pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Mar 2, 2026
…uantization (NVIDIA#11759)

Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Mar 9, 2026
…uantization (NVIDIA#11759)

Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
tianyuz-nv pushed a commit to wanqian-nv/TensorRT-LLM that referenced this pull request Mar 19, 2026
…uantization (NVIDIA#11759)

Signed-off-by: Kanghwan Jang <861393+karljang@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants