Skip to content

cudaGraphLaunch-launched kernels sometimes missing from critical path analysis #290

@LSaga

Description

@LSaga

🐛 Describe the bug

In certain cases during inference trace analysis, we observed that kernels launched via cudaGraphLaunch are not included in the critical path, even though they represent the bulk of model computation. Since these kernels typically perform core forward-pass operations, they should naturally appear on the critical path if they impact the overall latency.

Steps to reproduce

Steps to Reproduce:

  1. Run inference with CUDA Graph enabled (e.g., via torch.cuda.graph() or similar in frameworks like sglang).

  2. Collect a full trace that includes the host-side cudaGraphLaunch and its corresponding GPU kernel executions.

  3. Analyze the trace using HTA.

  4. Observe that the critical path does not include the kernels launched by cudaGraphLaunch, even when they take up most of the GPU time.

Expected behavior

Kernels launched via cudaGraphLaunch should be considered part of the critical path if their execution contributes to the end-to-end latency. These kernels should not be skipped during analysis.

Environment

OS: Mac, OS version: macOS Sequoia, Python version: 3.11.9, HTA version: source

Additional Info

cudaGraphLaunch bug.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions