feat: live in-flight batch-metrics snapshotter (opt-in) by YAMY1234 · Pull Request #115 · NVIDIA/srt-slurm

YAMY1234 · 2026-04-29T18:40:11Z

Summary

Adds an opt-in, zero-dependency-on-production-path feature that spawns a daemon thread during the benchmark stage.
Every N seconds it re-parses prefill/decode worker logs and atomically overwrites <log_dir>/batch_metrics.png, giving a near-real-time view of the run without any external monitoring stack.

live_metrics lives under the telemetry block in srtslurm.yaml — it is a lightweight complementary signal (log polling, no scraper container required) that fits naturally alongside the existing DCGM/node_exporter telemetry.

Files changed

File	Purpose
`src/srtctl/analysis/__init__.py`	New `analysis` subpackage
`src/srtctl/analysis/batch_log_parser.py`	Incremental, stateful SGLang log parser — no plotting, no aggregation
`src/srtctl/analysis/live_metrics.py`	Background snapshotter thread + matplotlib renderer
`src/srtctl/core/schema.py`	New `LiveMetricsConfig` dataclass nested under `TelemetryConfig`
`src/srtctl/cli/mixins/benchmark_stage.py`	3-line hook to start/stop the snapshotter around the benchmark proc

How to enable

In srtslurm.yaml:

telemetry:
  live_metrics:
    enabled: true
    interval_seconds: 60   # default
    downsample: 1          # keep every sample

When disabled (default), zero code paths are touched — try_start_snapshotter returns None immediately.

Plot layout

7 rows × 2 columns, semantically paired (prefill left, decode right):

Row	Prefill	Decode
0	input throughput (token/s)	gen throughput (token/s)
1	#new-seq	#running-req
2	#new-token	#full token
3	#cached-token	full token usage
4	#prealloc-req	#prealloc-req
5	#queue-req	#queue-req
6	#inflight-req	#transfer-req

Worker labels are shortened to p0…pN / d0…dN. Colour map is tab20 (supports up to 20 workers).

Design notes

Parser uses a byte_offset per file — safe to call repeatedly on growing logs.
Snapshotter swallows its own exceptions so a plotting failure never kills the benchmark.
Atomic write: fig.savefig(tmp) → os.replace(tmp, dst) — readers never see a torn file.
matplotlib is an optional soft dependency; the snapshotter degrades gracefully if missing.
A "final tick" always runs on shutdown so the last state is captured even if the benchmark crashes early.

Test plan

Smoke-tested locally against CoreWeave 491-tom-radixcache logs — plot generated correctly
End-to-end live smoke test on CoreWeave cluster pending
enabled: false (default) — confirm no behaviour change in existing benchmarks

Add an opt-in background daemon that re-parses prefill/decode worker logs every N seconds during the benchmark stage and overwrites `<log_dir>/batch_metrics.png` in place. Gives near-real-time visibility into running-req / queue-req / throughput / KV occupancy without any external monitoring stack — the orchestrator already shares the filesystem with worker logs, so we read them locally. Behaviour change is gated by `srtslurm.yaml`: reporting: live_metrics: enabled: true interval_seconds: 60 # default 60, min 5 downsample: 1 # keep every Nth point when plotting When disabled (default), this is a strict no-op on the benchmark path: `try_start_snapshotter()` returns None on the first config check. Code layout keeps the orchestrator hook minimal: src/srtctl/cli/mixins/benchmark_stage.py +16 / -7 one local import + one helper call + try/finally around the existing srun wait loop. No new mixin methods, no analysis-package internals leaking into the orchestrator. src/srtctl/core/schema.py +22 new LiveMetricsConfig dataclass plugged into ReportingConfig. src/srtctl/analysis/__init__.py (new) src/srtctl/analysis/batch_log_parser.py (new) pure parser: per-file FileSeries with a byte_offset cache so successive ticks only read newly appended bytes. No aggregation, no plotting; reusable from any consumer. src/srtctl/analysis/live_metrics.py (new) snapshotter daemon, simple per-worker matplotlib renderer (one line per worker file per metric — no DP-rank scaling guesses, no cluster aggregation), and `try_start_snapshotter()` orchestrator- facing helper. All failures (matplotlib missing, malformed cluster config, render error) are logged and swallowed. Aggregated cluster-wide views (sum-across-workers etc.) are intentionally left for a follow-up so they can be reviewed and tuned independently of the snapshotter mechanics. Made-with: Cursor

- batch_log_parser: label now emits "p0"/"d3" style short tags extracted from the log filename suffix instead of the full path - live_metrics: replace flat per-metric columns with semantic (prefill, decode) row pairs driven by _PLOT_ROWS; drop prefill #running-req; ensure decode #queue-req/#transfer-req always appear; switch colour map to tab20 for 20-worker support Made-with: Cursor

codecov-commenter · 2026-04-29T18:41:32Z

Codecov Report

❌ Patch coverage is 29.33333% with 212 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@1372a10). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/srtctl/analysis/live_metrics.py	23.28%	112 Missing ⚠️
src/srtctl/analysis/batch_log_parser.py	27.81%	96 Missing ⚠️
src/srtctl/cli/mixins/benchmark_stage.py	66.66%	4 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #115   +/-   ##
=======================================
  Coverage        ?   68.83%           
=======================================
  Files           ?       62           
  Lines           ?     6848           
  Branches        ?        0           
=======================================
  Hits            ?     4714           
  Misses          ?     2134           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ishandhanani · 2026-04-29T19:36:11Z

+        return None
+
+    try:
+        import matplotlib  # noqa: F401  -- imported here so we fail fast & loud


just have this be a default in srtslurm

live_metrics is a form of lightweight telemetry (log-polling + in-process PNG snapshotter), so it belongs under the telemetry block rather than reporting. - Remove LiveMetricsConfig from ReportingConfig - Add LiveMetricsConfig as an optional field on TelemetryConfig, with an updated docstring explaining the relationship - Update try_start_snapshotter to read telemetry.live_metrics from the cluster config dict Made-with: Cursor

… _validate_telemetry - Add `telemetry: dict | None = None` to `ClusterConfig` so srtslurm.yaml can carry a `telemetry:` top-level block without failing marshmallow validation and causing model_paths/containers to silently disappear. - Guard `_validate_telemetry()` against `telemetry is None` so recipes without a telemetry block do not crash with AttributeError.

The live-batch-metrics snapshotter requires matplotlib to render batch_metrics.png during benchmarks. Without it, srtctl silently skips the snapshotter. Making it a hard dependency ensures the feature works out of the box after pip install. Made-with: Cursor

YAMY1234 added 2 commits April 29, 2026 10:07

ishandhanani reviewed Apr 29, 2026

View reviewed changes

YAMY1234 and others added 3 commits April 29, 2026 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: live in-flight batch-metrics snapshotter (opt-in)#115

feat: live in-flight batch-metrics snapshotter (opt-in)#115
YAMY1234 wants to merge 5 commits intoNVIDIA:mainfrom
YAMY1234:yangminl/live-batch-metrics-v2

YAMY1234 commented Apr 29, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 29, 2026 •

edited

Loading

Uh oh!

ishandhanani Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YAMY1234 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files changed

How to enable

Plot layout

Design notes

Test plan

Uh oh!

codecov-commenter commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ishandhanani Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YAMY1234 commented Apr 29, 2026 •

edited

Loading

codecov-commenter commented Apr 29, 2026 •

edited

Loading