Summary
The Triton fused-MoE expert kernel (invoke_fused_moe_kernel, the dominant decode-time MoE GPU consumer for Qwen3-30B-A3B and similar models) reaches the torch trace as a pybind built-in (sglang_profiler::fused_moe_triton_kernels_invoke_fused_moe_kernel_427) whose top-level kernel event carries no resolvable Input Dims. As a result:
{category}_ops.csv (e.g. moe_fused_ops.csv) and moe_fused_metrics.json::operations[] carry an empty Input Dims / args, so the kernel cannot be roofline'd (efficiency_percent is null) and cannot be shape-anchored.
- The rendered
analysis.md P-item for this kernel has an empty Args column.
- Downstream consumers that require trace-anchored input shapes (e.g. the internal kernel-opt dispatch gate) reject the kernel with an
empty_kernel_shape error before any optimization harness is built.
This is the shape-capture half of the fused-MoE gap. It shares a root cause with the empty-Input Dims issues behind #726 / #727 (#727 added the perf model + surfaced the dominant kernel as a non-quantifiable P-item when its roofline is unresolved; this is the remaining half that recovers the operand shapes).
Root cause / where the dims actually live
TraceLens does still capture the operands for this kernel — just not on the dimensionless built-in event. The wrapped invocation is recorded per-shape in perf_report_csvs/ops_unique_args.csv, keyed by the embedded invoke_fused_moe_kernel symbol, with the two grouped-GEMM operand sets:
- gate/up GEMM:
A(num_tokens, H) x w1(E, 2*I, H) -> C(T, 2*I) → (15360,2048), (128,1536,2048), (122880,1536) (bf16)
- down GEMM:
A(T, I) x w2(E, H, I) -> C(num_tokens, topk, H) → (122880,768), (128,2048,768), (15360,8,2048) (bf16)
(Qwen3-30B-A3B MoE: E=128, top-8, H=2048, I=768; conc 64, ISL/OSL 1024.) These match the shapes a hand-written fused-MoE GEAK harness used to reach a validated 1.19x.
Fix
Recover the fused-MoE expert kernel's operand shapes from ops_unique_args.csv and render them into the operation's args (the same format_args rendering the resolved path uses) when the kernel's own Input Dims are empty. Scoped to the invoke_fused_moe_kernel op pattern so other kernels are untouched. See PR #727 (extended).
Cross-refs
Summary
The Triton fused-MoE expert kernel (
invoke_fused_moe_kernel, the dominant decode-time MoE GPU consumer for Qwen3-30B-A3B and similar models) reaches the torch trace as a pybind built-in (sglang_profiler::fused_moe_triton_kernels_invoke_fused_moe_kernel_427) whose top-level kernel event carries no resolvableInput Dims. As a result:{category}_ops.csv(e.g.moe_fused_ops.csv) andmoe_fused_metrics.json::operations[]carry an emptyInput Dims/args, so the kernel cannot be roofline'd (efficiency_percentis null) and cannot be shape-anchored.analysis.mdP-item for this kernel has an empty Args column.empty_kernel_shapeerror before any optimization harness is built.This is the shape-capture half of the fused-MoE gap. It shares a root cause with the empty-
Input Dimsissues behind #726 / #727 (#727 added the perf model + surfaced the dominant kernel as a non-quantifiable P-item when its roofline is unresolved; this is the remaining half that recovers the operand shapes).Root cause / where the dims actually live
TraceLens does still capture the operands for this kernel — just not on the dimensionless built-in event. The wrapped invocation is recorded per-shape in
perf_report_csvs/ops_unique_args.csv, keyed by the embeddedinvoke_fused_moe_kernelsymbol, with the two grouped-GEMM operand sets:A(num_tokens, H)xw1(E, 2*I, H)->C(T, 2*I)→(15360,2048),(128,1536,2048),(122880,1536)(bf16)A(T, I)xw2(E, H, I)->C(num_tokens, topk, H)→(122880,768),(128,2048,768),(15360,8,2048)(bf16)(Qwen3-30B-A3B MoE: E=128, top-8, H=2048, I=768; conc 64, ISL/OSL 1024.) These match the shapes a hand-written fused-MoE GEAK harness used to reach a validated 1.19x.
Fix
Recover the fused-MoE expert kernel's operand shapes from
ops_unique_args.csvand render them into the operation'sargs(the sameformat_argsrendering the resolved path uses) when the kernel's ownInput Dimsare empty. Scoped to theinvoke_fused_moe_kernelop pattern so other kernels are untouched. See PR #727 (extended).Cross-refs
othercategory → no optimization candidate #726, Roofline the Triton fused-MoE expert GEMM (invoke_fused_moe_kernel) #727 (same empty-Input Dimsroot)ops_unique_args.csvso the kernel-opt dispatch gate (_validate_kernel_shape_and_paths) passes withshape_provenance=torch_trace.