feat: peak gen throughput metric in sa-bench + server-side node metrics CSV export by zhengd-nv · Pull Request #93 · NVIDIA/srt-slurm

zhengd-nv · 2026-04-27T07:01:59Z

Summary

This PR adds two complementary gen-throughput measurement capabilities:

Client-side: peak gen throughput in sa-bench (`benchmark_serving.py`)

Adds a peak_output_tokens_per_s metric to align with sglang bench_serving.py's peak gen throughput reporting.

Records start_time and per-chunk text_chunks on each RequestFuncOutput across all backends (OpenAI completions, chat completions, TRT-LLM, Dynamo).
After the benchmark completes, reconstructs per-chunk absolute arrival times from start_time + ttft + cumulative ITL. Because sa-bench ITL is per SSE chunk (not per token), each chunk's text is tokenized to get an accurate token count.
Buckets token arrivals into 1-second windows, applies a 10-sample moving-average smoothing, and reports the peak as peak_output_tokens_per_s.
Printed alongside output_throughput and included in the JSON result.

Server-side: per-node batch metrics CSV export (`analysis/srtlog`)

Adds analysis/srtlog/export_node_metrics.py to extract batch-level metrics from prefill/decode Slurm logs and write them to CSV. Server-side data captures both running_req and gen_throughput per batch step, enabling more precise analysis of how gen throughput varies with concurrency — something the client-side metric cannot provide.

One CSV per node named {node}_{worker_type}_{worker_id}.csv; columns cover all batch fields (token usage, queue depth, throughput, etc.).
A gen_throughput.csv summary groups by running_req and reports count/mean/median of gen_throughput.
Can be run standalone: python -m analysis.srtlog.export_node_metrics <run_path>
Integrated into the postprocess pipeline via benchmark.export_node_metrics: true in the job config; runs in an ephemeral venv to avoid polluting the container environment.
NodeAnalyzer.parse_run_logs() now also scans <run_path>/logs/ (matching the actual srt-slurm job output layout).
RunMetadata.format_date() handles additional timestamp formats (%Y-%m-%d %H:%M:%S[.%f]).

Usage

Run sa-bench against a live endpoint; confirm Peak output token throughput (tok/s) appears in output and the value is plausible relative to Output token throughput. Verify peak_output_tokens_per_s is present in the JSON result file.
Run python -m analysis.srtlog.export_node_metrics <run_path> on a completed job directory; confirm per-node CSVs and gen_throughput.csv are created under logs/node_metrics/.
Set benchmark.export_node_metrics: true in a job config and confirm CSVs are written automatically at the end of a sweep.

codecov-commenter · 2026-04-27T07:03:47Z

Codecov Report

❌ Patch coverage is 27.90698% with 31 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@1372a10). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/srtctl/cli/mixins/postprocess_stage.py	26.19%	31 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #93   +/-   ##
=======================================
  Coverage        ?   70.35%           
=======================================
  Files           ?       60           
  Lines           ?     6595           
  Branches        ?        0           
=======================================
  Hits            ?     4640           
  Misses          ?     1955           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ishandhanani · 2026-04-27T22:18:29Z

ping me on slack when ready to merge

zhengd-nv added 7 commits April 26, 2026 19:27

export node metrics

ea66b55

extract node metrics option

a8ef108

use uv for export script

eb50369

update comment

e994658

peak throughput

5a9a9aa

smooth

18821ec

fix log parse

2d5f6c3

ruff format

054db66

Merge remote-tracking branch 'origin/main' into node-metrics

cd77f45

zhengd-nv marked this pull request as ready for review April 29, 2026 07:40

zhengd-nv requested review from alec-flowers, csahithi, ishandhanani and nlevin-ui as code owners April 29, 2026 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: peak gen throughput metric in sa-bench + server-side node metrics CSV export#93

feat: peak gen throughput metric in sa-bench + server-side node metrics CSV export#93
zhengd-nv wants to merge 9 commits intoNVIDIA:mainfrom
zhengd-nv:node-metrics

zhengd-nv commented Apr 27, 2026

Uh oh!

codecov-commenter commented Apr 27, 2026 •

edited

Loading

Uh oh!

ishandhanani commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhengd-nv commented Apr 27, 2026

Summary

Client-side: peak gen throughput in sa-bench (benchmark_serving.py)

Server-side: per-node batch metrics CSV export (analysis/srtlog)

Usage

Uh oh!

codecov-commenter commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ishandhanani commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Client-side: peak gen throughput in sa-bench (`benchmark_serving.py`)

Server-side: per-node batch metrics CSV export (`analysis/srtlog`)

codecov-commenter commented Apr 27, 2026 •

edited

Loading