Checklist
Describe the bug
multiple_iteration_timeit_from_trace](https://github.com/sgl-project/sglang-jax/blob/main/python/sgl_jax/srt/kernels/utils/perf.py#L77-L102) in python/sgl_jax/srt/kernels/utils/perf.py was reporting host wall time, not device kernel time — inflating measurements 4–10× and ranking candidates by host noise.
Two compounding issues in _extract_marker_durations_ms:
- The intended
MARKER in args.tf_op filter found zero events, because [jax.named_scope](https://github.com/sgl-project/sglang-jax/
blob/main/python/sgl_jax/srt/kernels/utils/perf.py#L96) wrapping an already-compiled jit at runtime doesn't propagate into HLO op_name (named_scope only injects metadata when active during tracing).
- The fallback regex match against
task matched the host StepTraceAnnotation event, then min_pid biased to the host plane. Those events have no device_duration_ps, so the code read dur — host wall time including block_until_ready.
I want to confirm whether this makes sense and ask for suggestions and I will make a PR to search directly for device plane's event for more accurate kernel time.
Reproduction
Any kernel measurement
Environment
attention-upgrade branch
Checklist
Describe the bug
multiple_iteration_timeit_from_trace](https://github.com/sgl-project/sglang-jax/blob/main/python/sgl_jax/srt/kernels/utils/perf.py#L77-L102) inpython/sgl_jax/srt/kernels/utils/perf.pywas reporting host wall time, not device kernel time — inflating measurements 4–10× and ranking candidates by host noise.Two compounding issues in
_extract_marker_durations_ms:MARKER in args.tf_opfilter found zero events, because [jax.named_scope](https://github.com/sgl-project/sglang-jax/blob/main/python/sgl_jax/srt/kernels/utils/perf.py#L96) wrapping an already-compiled jit at runtime doesn't propagate into HLO
op_name(named_scope only injects metadata when active during tracing).taskmatched the hostStepTraceAnnotationevent, thenmin_pidbiased to the host plane. Those events have nodevice_duration_ps, so the code readdur— host wall time includingblock_until_ready.I want to confirm whether this makes sense and ask for suggestions and I will make a PR to search directly for device plane's event for more accurate kernel time.
Reproduction
Any kernel measurement
Environment
attention-upgrade branch