-
Notifications
You must be signed in to change notification settings - Fork 82
Description
🐛 Describe the bug
In certain cases during inference trace analysis, we observed that kernels launched via cudaGraphLaunch are not included in the critical path, even though they represent the bulk of model computation. Since these kernels typically perform core forward-pass operations, they should naturally appear on the critical path if they impact the overall latency.
Steps to reproduce
Steps to Reproduce:
-
Run inference with CUDA Graph enabled (e.g., via torch.cuda.graph() or similar in frameworks like sglang).
-
Collect a full trace that includes the host-side cudaGraphLaunch and its corresponding GPU kernel executions.
-
Analyze the trace using HTA.
-
Observe that the critical path does not include the kernels launched by cudaGraphLaunch, even when they take up most of the GPU time.
Expected behavior
Kernels launched via cudaGraphLaunch should be considered part of the critical path if their execution contributes to the end-to-end latency. These kernels should not be skipped during analysis.
Environment
OS: Mac, OS version: macOS Sequoia, Python version: 3.11.9, HTA version: source