Skip to content

feat: SGLang decode slow_down for PD disagg nsys profiling (with skip-warmup workflow)#60

Open
zhengd-nv wants to merge 10 commits intoNVIDIA:mainfrom
zhengd-nv:slow-down
Open

feat: SGLang decode slow_down for PD disagg nsys profiling (with skip-warmup workflow)#60
zhengd-nv wants to merge 10 commits intoNVIDIA:mainfrom
zhengd-nv:slow-down

Conversation

@zhengd-nv
Copy link
Copy Markdown

@zhengd-nv zhengd-nv commented Apr 23, 2026

PR description

Summary

Wires SGLang’s /slow_down on decode worker leaders from job YAML so that, in PD disaggregated runs, the first decode forwards can be stretched in time while prefill catches up and decode batching builds. This is intended to line up nsys decode step windows with a saturated decode phase. This workflow is only applicable for sglang frontend (sglang-router).

slow_down is designed to be used together with SA-Bench warmup disabled (num_warmup_mult: 0). The built-in benchmark warmup is skipped so decode step indices stay predictable. The usual “warmup” role is instead covered by a number of real decode forwards after slow_down auto-clears and before the nsys capture window—those steps bring decode (e.g. cuda graphs, batching) to a steady state before profiling.

Mapping profiling.decode.start_step (example recipe)

For the example workload, decode nsys capture is started at a step chosen so the window begins after:

  1. Bootstrap — here modeled as osl steps (1024 for osl: 1024);
  2. slow_down window — a small number of forwards while /slow_down is active (e.g. 4 steps in the example, tied to your slow_down_* timing);
  3. Post-slow_down warmup — additional forwards after slow-down clears (e.g. 72 steps) so decode is “hot” before nsys.

So in the example:

decode.start_step (1100) = bootstrap_steps (1024, = osl) + slow_down_steps (4) + warmup_steps (72).

Tweak the three terms if you change osl, concurrency, or slow_down_*, and set profiling.decode.start_step / stop_step accordingly.

User-facing changes

  • YAML: benchmark.slow_down_sleep_time + slow_down_wait_time (both set, SGLang frontend) → srtctl passes decode leader URLs to SA-Bench; see benchmark_stage / bench.sh / benchmark_serving.py.
  • bench.sh: optional skip warmup when NUM_WARMUP_MULT is 0.
  • Example: recipes/.../1p1d-dep4-nsys-profile-slowdown.yaml documents the skip-warmup + slow_down + step budget for nsys in the file header and next to decode.start_step / num_warmup_mult.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 15.38462% with 22 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@698590e). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/srtctl/cli/mixins/benchmark_stage.py 8.33% 22 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #60   +/-   ##
=======================================
  Coverage        ?   70.38%           
=======================================
  Files           ?       60           
  Lines           ?     6571           
  Branches        ?        0           
=======================================
  Hits            ?     4625           
  Misses          ?     1946           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants