Skip to content

fix(source-resolution): emit device .cu for @compile_ops/torch.ops launchers so high-E2E editable kernels reach the optimizer #734

@iraj465

Description

@iraj465

title: fix(source-resolution): emit device .cu for @compile_ops/torch.ops launchers so high-E2E editable kernels reach the optimizer

Summary

High-E2E, genuinely-editable GPU kernels are dropped before optimization because TraceLens resolves each kernel's source_path to its Python launcher / dispatch stub instead of the rewritable device source. Hyperloom's patchability gate then (correctly) classifies the launcher path as "source not under a reusable framework root" and skips the kernel — so it never reaches GEAK.

This is SEPARATE from the SGLang splitter issue (#733). Proof: across the fleet these kernels are skipped with full shapes/phase present — i.e. they came through a HEALTHY split + analyzer and were dropped only at source resolution. The splitter fix does not address this.

Evidence (from a real, healthy analysis.md)

From the attached analysis_AFTER_patched.md (gpt-oss-120B, healthy split, shapes present), TraceLens attributes compute kernels to their Python launchers, not device source:

| aiter::add_rmsnorm | (16384,2880) bf16 ... | ops/rmsnorm.py(76): rmsnorm2d_fwd_with_add | 55.474 | ... | 73.56% of 8.0 TB/s | memory-bound |
| aten::addmm        | (16384,1024)x(1024,2880) bf16 | aiter/tuned_gemm.py(395): torch_gemm | 44.956 | ... | 55.14% of 1686 TFLOPS | compute-bound |

ops/rmsnorm.py(76) and aiter/tuned_gemm.py(395) are thin dispatch stubs — the actual compute lives in device .cu that TraceLens does not resolve/emit.

Affected ops (observed; max %E2E)

Kernel max %E2E WRONG resolved path (launcher) CORRECT device source
aiter::fmoe_fp8_blockscale_g1u1 12.8% aiter/fused_moe.py(367): fused_moestage csrc/py_itfs_cu/asm_fmoe.cu
aiter::ck_moe_stage1 14.8% ops/moe_op.py(522): ck_moe_stage1_fwd (@compile_ops) csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages.cu
aiter::ck_moe_stage2 7.5% ops/moe_op.py(579): ck_moe_stage2_fwd (same .cu)
aiter::add_rmsnorm 9.2% ops/rmsnorm.py(76): rmsnorm2d_fwd_with_add csrc/kernels/rmsnorm_quant_kernels.cu
sgl_kernel::silu_and_mul 4.6% /opt/venv/.../sgl_kernel/elementwise.py (torch.ops stub) sglang/sgl-kernel/csrc/elementwise/activation.cu
aiter::rmsnorm 3.6% ops/rmsnorm.py(62): rmsnorm2d_fwd aiter rmsnorm device source
aiter::gemm_a8w8_blockscale_ck 1.0% aiter/ops/gemm_op_a8w8.py csrc/ck_gemm_a8w8_blockscale/gemm_a8w8_blockscale.cu

(aten::mm / aten::_scaled_mm / vllm::rocm_unquantized_gemm correctly dropped — vendor BLAS / dispatch shims, no rewritable source. Not in scope.)

Root cause

TraceLens attributes the kernel to the traced call-site (the @compile_ops / torch.ops Python launcher), a thin dispatch stub. The actual compute lives in a device .cu (or a JIT-compiled .so) that TraceLens does not resolve/emit, so analysis.md's source_path points at non-rewritable wrapper code.

Requested fix (same governance shape as the splitter fix: additive, single-contract)

For @compile_ops (aiter) and torch.ops.<ns>.<op> (sgl_kernel) launchers, resolve and emit the device source (.cu) + owning repo into the existing source_path/repo fields of analysis.md — no new fields, no new code paths, framework-additive (unknown launchers fall through unchanged). Hyperloom consumes the corrected analysis.md unchanged (single exit point preserved).

Reproduction

  1. Run TraceLens analysis on a healthy SGLang gpt-oss-120B trace (requires fix(splitter): recognize SGLang step[...] per-forward annotations so inference traces split through the standard steady-state path #733's splitter fix so the split succeeds — attach analysis_AFTER_patched.md is one such output).
  2. Grep the per-kernel tables for source_path/launcher column:
    grep -E 'rmsnorm.py\(|tuned_gemm.py\(|fused_moe.py\(|moe_op.py\(' analysis.md
  3. Observe every high-%E2E aiter/sgl_kernel op resolves to a .py launcher, never a .cu.
  4. Feed that analysis.md to Hyperloom's patchability gate → these kernels are skipped as "source not under a reusable framework root" despite having full shapes/phase.

Second-order note (.so trap; Hyperloom-side, for awareness)

Even with the correct .cu, a patch only affects runtime if the patched source is what executes — i.e. an editable/JIT path, not a prebuilt wheel .so. aiter is editable+JIT (rebuild-effective). sgl_kernel currently loads a prebuilt wheel common_ops.so, so silu_and_mul would need an editable sgl-kernel build for a patch to take effect. Handled Hyperloom-side as a dispatch precondition; noted here only so the source-path fix isn't mistaken for sufficient on sgl_kernel.

Cleanup to coordinate

A source-promotion shim currently exists in Hyperloom (tracelens_analysis.py: upgrade_aiter_compile_ops_launcher). That is the kind of Hyperloom-side recovery the team is moving away from; once TraceLens emits the correct source_path, that shim should be removed (logic owned by TraceLens).

Relationship

Companion to the SGLang splitter issue #733 (the recognizer/steady-state fix). That one is upstream (whether a good analysis.md is produced); this one is downstream (whether a correctly-characterized kernel reaches the optimizer). Independent and complementary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions