fix(source-resolution): emit device .cu for @compile_ops/torch.ops launchers so high-E2E editable kernels reach the optimizer

**title:** fix(source-resolution): emit device `.cu` for `@compile_ops`/`torch.ops` launchers so high-E2E editable kernels reach the optimizer

## Summary
High-E2E, genuinely-editable GPU kernels are dropped before optimization because TraceLens resolves each kernel's `source_path` to its **Python launcher / dispatch stub** instead of the rewritable device source. Hyperloom's patchability gate then (correctly) classifies the launcher path as "source not under a reusable framework root" and skips the kernel — so it never reaches GEAK.

This is SEPARATE from the SGLang splitter issue (#733). Proof: across the fleet these kernels are skipped **with full shapes/phase present** — i.e. they came through a HEALTHY split + analyzer and were dropped only at source resolution. The splitter fix does not address this.

## Evidence (from a real, healthy `analysis.md`)
From the attached `analysis_AFTER_patched.md` (gpt-oss-120B, healthy split, shapes present), TraceLens attributes compute kernels to their **Python launchers**, not device source:

```
| aiter::add_rmsnorm | (16384,2880) bf16 ... | ops/rmsnorm.py(76): rmsnorm2d_fwd_with_add | 55.474 | ... | 73.56% of 8.0 TB/s | memory-bound |
| aten::addmm        | (16384,1024)x(1024,2880) bf16 | aiter/tuned_gemm.py(395): torch_gemm | 44.956 | ... | 55.14% of 1686 TFLOPS | compute-bound |
```

`ops/rmsnorm.py(76)` and `aiter/tuned_gemm.py(395)` are thin dispatch stubs — the actual compute lives in device `.cu` that TraceLens does not resolve/emit.

## Affected ops (observed; max %E2E)
| Kernel | max %E2E | WRONG resolved path (launcher) | CORRECT device source |
|---|---|---|---|
| aiter::fmoe_fp8_blockscale_g1u1 | 12.8% | aiter/fused_moe.py(367): fused_moestage | csrc/py_itfs_cu/asm_fmoe.cu |
| aiter::ck_moe_stage1 | 14.8% | ops/moe_op.py(522): ck_moe_stage1_fwd (@compile_ops) | csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages.cu |
| aiter::ck_moe_stage2 | 7.5% | ops/moe_op.py(579): ck_moe_stage2_fwd | (same .cu) |
| aiter::add_rmsnorm | 9.2% | ops/rmsnorm.py(76): rmsnorm2d_fwd_with_add | csrc/kernels/rmsnorm_quant_kernels.cu |
| sgl_kernel::silu_and_mul | 4.6% | /opt/venv/.../sgl_kernel/elementwise.py (torch.ops stub) | sglang/sgl-kernel/csrc/elementwise/activation.cu |
| aiter::rmsnorm | 3.6% | ops/rmsnorm.py(62): rmsnorm2d_fwd | aiter rmsnorm device source |
| aiter::gemm_a8w8_blockscale_ck | 1.0% | aiter/ops/gemm_op_a8w8.py | csrc/ck_gemm_a8w8_blockscale/gemm_a8w8_blockscale.cu |

(`aten::mm` / `aten::_scaled_mm` / `vllm::rocm_unquantized_gemm` correctly dropped — vendor BLAS / dispatch shims, no rewritable source. Not in scope.)

## Root cause
TraceLens attributes the kernel to the traced call-site (the `@compile_ops` / `torch.ops` Python launcher), a thin dispatch stub. The actual compute lives in a device `.cu` (or a JIT-compiled `.so`) that TraceLens does not resolve/emit, so `analysis.md`'s `source_path` points at non-rewritable wrapper code.

## Requested fix (same governance shape as the splitter fix: additive, single-contract)
For `@compile_ops` (aiter) and `torch.ops.<ns>.<op>` (sgl_kernel) launchers, resolve and emit the **device source** (`.cu`) + owning `repo` into the existing `source_path`/`repo` fields of `analysis.md` — no new fields, no new code paths, framework-additive (unknown launchers fall through unchanged). Hyperloom consumes the corrected `analysis.md` unchanged (single exit point preserved).

## Reproduction
1. Run TraceLens analysis on a healthy SGLang gpt-oss-120B trace (requires #733's splitter fix so the split succeeds — attach `analysis_AFTER_patched.md` is one such output).
2. Grep the per-kernel tables for `source_path`/launcher column:
   `grep -E 'rmsnorm.py\(|tuned_gemm.py\(|fused_moe.py\(|moe_op.py\(' analysis.md`
3. Observe every high-%E2E aiter/sgl_kernel op resolves to a `.py` launcher, never a `.cu`.
4. Feed that `analysis.md` to Hyperloom's patchability gate → these kernels are skipped as "source not under a reusable framework root" despite having full shapes/phase.

## Second-order note (.so trap; Hyperloom-side, for awareness)
Even with the correct `.cu`, a patch only affects runtime if the patched source is what executes — i.e. an editable/JIT path, not a prebuilt wheel `.so`. aiter is editable+JIT (rebuild-effective). sgl_kernel currently loads a prebuilt wheel `common_ops.so`, so `silu_and_mul` would need an editable sgl-kernel build for a patch to take effect. Handled Hyperloom-side as a dispatch precondition; noted here only so the source-path fix isn't mistaken for sufficient on sgl_kernel.

## Cleanup to coordinate
A source-promotion shim currently exists in Hyperloom (`tracelens_analysis.py: upgrade_aiter_compile_ops_launcher`). That is the kind of Hyperloom-side recovery the team is moving away from; once TraceLens emits the correct `source_path`, that shim should be removed (logic owned by TraceLens).

## Relationship
Companion to the SGLang splitter issue #733 (the recognizer/steady-state fix). That one is upstream (whether a good `analysis.md` is produced); this one is downstream (whether a correctly-characterized kernel reaches the optimizer). Independent and complementary.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(source-resolution): emit device .cu for @compile_ops/torch.ops launchers so high-E2E editable kernels reach the optimizer #734

Summary

Evidence (from a real, healthy `analysis.md`)

Affected ops (observed; max %E2E)

Root cause

Requested fix (same governance shape as the splitter fix: additive, single-contract)

Reproduction

Second-order note (.so trap; Hyperloom-side, for awareness)

Cleanup to coordinate

Relationship

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Kernel	max %E2E	WRONG resolved path (launcher)	CORRECT device source
aiter::fmoe_fp8_blockscale_g1u1	12.8%	aiter/fused_moe.py(367): fused_moestage	csrc/py_itfs_cu/asm_fmoe.cu
aiter::ck_moe_stage1	14.8%	ops/moe_op.py(522): ck_moe_stage1_fwd (@compile_ops)	csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages.cu
aiter::ck_moe_stage2	7.5%	ops/moe_op.py(579): ck_moe_stage2_fwd	(same .cu)
aiter::add_rmsnorm	9.2%	ops/rmsnorm.py(76): rmsnorm2d_fwd_with_add	csrc/kernels/rmsnorm_quant_kernels.cu
sgl_kernel::silu_and_mul	4.6%	/opt/venv/.../sgl_kernel/elementwise.py (torch.ops stub)	sglang/sgl-kernel/csrc/elementwise/activation.cu
aiter::rmsnorm	3.6%	ops/rmsnorm.py(62): rmsnorm2d_fwd	aiter rmsnorm device source
aiter::gemm_a8w8_blockscale_ck	1.0%	aiter/ops/gemm_op_a8w8.py	csrc/ck_gemm_a8w8_blockscale/gemm_a8w8_blockscale.cu

fix(source-resolution): emit device .cu for @compile_ops/torch.ops launchers so high-E2E editable kernels reach the optimizer #734

Description

Summary

Evidence (from a real, healthy analysis.md)

Affected ops (observed; max %E2E)

Root cause

Requested fix (same governance shape as the splitter fix: additive, single-contract)

Reproduction

Second-order note (.so trap; Hyperloom-side, for awareness)

Cleanup to coordinate

Relationship

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Evidence (from a real, healthy `analysis.md`)