You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: fix(source-resolution): emit device .cu for @compile_ops/torch.ops launchers so high-E2E editable kernels reach the optimizer
Summary
High-E2E, genuinely-editable GPU kernels are dropped before optimization because TraceLens resolves each kernel's source_path to its Python launcher / dispatch stub instead of the rewritable device source. Hyperloom's patchability gate then (correctly) classifies the launcher path as "source not under a reusable framework root" and skips the kernel — so it never reaches GEAK.
This is SEPARATE from the SGLang splitter issue (#733). Proof: across the fleet these kernels are skipped with full shapes/phase present — i.e. they came through a HEALTHY split + analyzer and were dropped only at source resolution. The splitter fix does not address this.
Evidence (from a real, healthy analysis.md)
From the attached analysis_AFTER_patched.md (gpt-oss-120B, healthy split, shapes present), TraceLens attributes compute kernels to their Python launchers, not device source:
(aten::mm / aten::_scaled_mm / vllm::rocm_unquantized_gemm correctly dropped — vendor BLAS / dispatch shims, no rewritable source. Not in scope.)
Root cause
TraceLens attributes the kernel to the traced call-site (the @compile_ops / torch.ops Python launcher), a thin dispatch stub. The actual compute lives in a device .cu (or a JIT-compiled .so) that TraceLens does not resolve/emit, so analysis.md's source_path points at non-rewritable wrapper code.
Requested fix (same governance shape as the splitter fix: additive, single-contract)
For @compile_ops (aiter) and torch.ops.<ns>.<op> (sgl_kernel) launchers, resolve and emit the device source (.cu) + owning repo into the existing source_path/repo fields of analysis.md — no new fields, no new code paths, framework-additive (unknown launchers fall through unchanged). Hyperloom consumes the corrected analysis.md unchanged (single exit point preserved).
Grep the per-kernel tables for source_path/launcher column: grep -E 'rmsnorm.py\(|tuned_gemm.py\(|fused_moe.py\(|moe_op.py\(' analysis.md
Observe every high-%E2E aiter/sgl_kernel op resolves to a .py launcher, never a .cu.
Feed that analysis.md to Hyperloom's patchability gate → these kernels are skipped as "source not under a reusable framework root" despite having full shapes/phase.
Second-order note (.so trap; Hyperloom-side, for awareness)
Even with the correct .cu, a patch only affects runtime if the patched source is what executes — i.e. an editable/JIT path, not a prebuilt wheel .so. aiter is editable+JIT (rebuild-effective). sgl_kernel currently loads a prebuilt wheel common_ops.so, so silu_and_mul would need an editable sgl-kernel build for a patch to take effect. Handled Hyperloom-side as a dispatch precondition; noted here only so the source-path fix isn't mistaken for sufficient on sgl_kernel.
Cleanup to coordinate
A source-promotion shim currently exists in Hyperloom (tracelens_analysis.py: upgrade_aiter_compile_ops_launcher). That is the kind of Hyperloom-side recovery the team is moving away from; once TraceLens emits the correct source_path, that shim should be removed (logic owned by TraceLens).
Relationship
Companion to the SGLang splitter issue #733 (the recognizer/steady-state fix). That one is upstream (whether a good analysis.md is produced); this one is downstream (whether a correctly-characterized kernel reaches the optimizer). Independent and complementary.
title: fix(source-resolution): emit device
.cufor@compile_ops/torch.opslaunchers so high-E2E editable kernels reach the optimizerSummary
High-E2E, genuinely-editable GPU kernels are dropped before optimization because TraceLens resolves each kernel's
source_pathto its Python launcher / dispatch stub instead of the rewritable device source. Hyperloom's patchability gate then (correctly) classifies the launcher path as "source not under a reusable framework root" and skips the kernel — so it never reaches GEAK.This is SEPARATE from the SGLang splitter issue (#733). Proof: across the fleet these kernels are skipped with full shapes/phase present — i.e. they came through a HEALTHY split + analyzer and were dropped only at source resolution. The splitter fix does not address this.
Evidence (from a real, healthy
analysis.md)From the attached
analysis_AFTER_patched.md(gpt-oss-120B, healthy split, shapes present), TraceLens attributes compute kernels to their Python launchers, not device source:ops/rmsnorm.py(76)andaiter/tuned_gemm.py(395)are thin dispatch stubs — the actual compute lives in device.cuthat TraceLens does not resolve/emit.Affected ops (observed; max %E2E)
(
aten::mm/aten::_scaled_mm/vllm::rocm_unquantized_gemmcorrectly dropped — vendor BLAS / dispatch shims, no rewritable source. Not in scope.)Root cause
TraceLens attributes the kernel to the traced call-site (the
@compile_ops/torch.opsPython launcher), a thin dispatch stub. The actual compute lives in a device.cu(or a JIT-compiled.so) that TraceLens does not resolve/emit, soanalysis.md'ssource_pathpoints at non-rewritable wrapper code.Requested fix (same governance shape as the splitter fix: additive, single-contract)
For
@compile_ops(aiter) andtorch.ops.<ns>.<op>(sgl_kernel) launchers, resolve and emit the device source (.cu) + owningrepointo the existingsource_path/repofields ofanalysis.md— no new fields, no new code paths, framework-additive (unknown launchers fall through unchanged). Hyperloom consumes the correctedanalysis.mdunchanged (single exit point preserved).Reproduction
analysis_AFTER_patched.mdis one such output).source_path/launcher column:grep -E 'rmsnorm.py\(|tuned_gemm.py\(|fused_moe.py\(|moe_op.py\(' analysis.md.pylauncher, never a.cu.analysis.mdto Hyperloom's patchability gate → these kernels are skipped as "source not under a reusable framework root" despite having full shapes/phase.Second-order note (.so trap; Hyperloom-side, for awareness)
Even with the correct
.cu, a patch only affects runtime if the patched source is what executes — i.e. an editable/JIT path, not a prebuilt wheel.so. aiter is editable+JIT (rebuild-effective). sgl_kernel currently loads a prebuilt wheelcommon_ops.so, sosilu_and_mulwould need an editable sgl-kernel build for a patch to take effect. Handled Hyperloom-side as a dispatch precondition; noted here only so the source-path fix isn't mistaken for sufficient on sgl_kernel.Cleanup to coordinate
A source-promotion shim currently exists in Hyperloom (
tracelens_analysis.py: upgrade_aiter_compile_ops_launcher). That is the kind of Hyperloom-side recovery the team is moving away from; once TraceLens emits the correctsource_path, that shim should be removed (logic owned by TraceLens).Relationship
Companion to the SGLang splitter issue #733 (the recognizer/steady-state fix). That one is upstream (whether a good
analysis.mdis produced); this one is downstream (whether a correctly-characterized kernel reaches the optimizer). Independent and complementary.