Skip to content

feat: live in-flight batch-metrics snapshotter (opt-in)#115

Draft
YAMY1234 wants to merge 5 commits intoNVIDIA:mainfrom
YAMY1234:yangminl/live-batch-metrics-v2
Draft

feat: live in-flight batch-metrics snapshotter (opt-in)#115
YAMY1234 wants to merge 5 commits intoNVIDIA:mainfrom
YAMY1234:yangminl/live-batch-metrics-v2

Conversation

@YAMY1234
Copy link
Copy Markdown
Collaborator

@YAMY1234 YAMY1234 commented Apr 29, 2026

Summary

Adds an opt-in, zero-dependency-on-production-path feature that spawns a daemon thread during the benchmark stage.
Every N seconds it re-parses prefill/decode worker logs and atomically overwrites <log_dir>/batch_metrics.png, giving a near-real-time view of the run without any external monitoring stack.

live_metrics lives under the telemetry block in srtslurm.yaml — it is a lightweight complementary signal (log polling, no scraper container required) that fits naturally alongside the existing DCGM/node_exporter telemetry.

Files changed

File Purpose
src/srtctl/analysis/__init__.py New analysis subpackage
src/srtctl/analysis/batch_log_parser.py Incremental, stateful SGLang log parser — no plotting, no aggregation
src/srtctl/analysis/live_metrics.py Background snapshotter thread + matplotlib renderer
src/srtctl/core/schema.py New LiveMetricsConfig dataclass nested under TelemetryConfig
src/srtctl/cli/mixins/benchmark_stage.py 3-line hook to start/stop the snapshotter around the benchmark proc

How to enable

In srtslurm.yaml:

telemetry:
  live_metrics:
    enabled: true
    interval_seconds: 60   # default
    downsample: 1          # keep every sample

When disabled (default), zero code paths are touchedtry_start_snapshotter returns None immediately.

Plot layout

7 rows × 2 columns, semantically paired (prefill left, decode right):

Row Prefill Decode
0 input throughput (token/s) gen throughput (token/s)
1 #new-seq #running-req
2 #new-token #full token
3 #cached-token full token usage
4 #prealloc-req #prealloc-req
5 #queue-req #queue-req
6 #inflight-req #transfer-req

Worker labels are shortened to p0pN / d0dN. Colour map is tab20 (supports up to 20 workers).

Design notes

  • Parser uses a byte_offset per file — safe to call repeatedly on growing logs.
  • Snapshotter swallows its own exceptions so a plotting failure never kills the benchmark.
  • Atomic write: fig.savefig(tmp) → os.replace(tmp, dst) — readers never see a torn file.
  • matplotlib is an optional soft dependency; the snapshotter degrades gracefully if missing.
  • A "final tick" always runs on shutdown so the last state is captured even if the benchmark crashes early.

Test plan

  • Smoke-tested locally against CoreWeave 491-tom-radixcache logs — plot generated correctly
  • End-to-end live smoke test on CoreWeave cluster pending
  • enabled: false (default) — confirm no behaviour change in existing benchmarks

Add an opt-in background daemon that re-parses prefill/decode worker
logs every N seconds during the benchmark stage and overwrites
`<log_dir>/batch_metrics.png` in place. Gives near-real-time visibility
into running-req / queue-req / throughput / KV occupancy without any
external monitoring stack — the orchestrator already shares the
filesystem with worker logs, so we read them locally.

Behaviour change is gated by `srtslurm.yaml`:

    reporting:
      live_metrics:
        enabled: true
        interval_seconds: 60   # default 60, min 5
        downsample: 1          # keep every Nth point when plotting

When disabled (default), this is a strict no-op on the benchmark path:
`try_start_snapshotter()` returns None on the first config check.

Code layout keeps the orchestrator hook minimal:

  src/srtctl/cli/mixins/benchmark_stage.py  +16 / -7
      one local import + one helper call + try/finally around the
      existing srun wait loop. No new mixin methods, no analysis-package
      internals leaking into the orchestrator.

  src/srtctl/core/schema.py                 +22
      new LiveMetricsConfig dataclass plugged into ReportingConfig.

  src/srtctl/analysis/__init__.py            (new)
  src/srtctl/analysis/batch_log_parser.py    (new)
      pure parser: per-file FileSeries with a byte_offset cache so
      successive ticks only read newly appended bytes. No aggregation,
      no plotting; reusable from any consumer.

  src/srtctl/analysis/live_metrics.py        (new)
      snapshotter daemon, simple per-worker matplotlib renderer (one
      line per worker file per metric — no DP-rank scaling guesses, no
      cluster aggregation), and `try_start_snapshotter()` orchestrator-
      facing helper. All failures (matplotlib missing, malformed
      cluster config, render error) are logged and swallowed.

Aggregated cluster-wide views (sum-across-workers etc.) are
intentionally left for a follow-up so they can be reviewed and tuned
independently of the snapshotter mechanics.

Made-with: Cursor
- batch_log_parser: label now emits "p0"/"d3" style short tags
  extracted from the log filename suffix instead of the full path
- live_metrics: replace flat per-metric columns with semantic
  (prefill, decode) row pairs driven by _PLOT_ROWS; drop
  prefill #running-req; ensure decode #queue-req/#transfer-req
  always appear; switch colour map to tab20 for 20-worker support

Made-with: Cursor
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 29.33333% with 212 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@1372a10). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/srtctl/analysis/live_metrics.py 23.28% 112 Missing ⚠️
src/srtctl/analysis/batch_log_parser.py 27.81% 96 Missing ⚠️
src/srtctl/cli/mixins/benchmark_stage.py 66.66% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #115   +/-   ##
=======================================
  Coverage        ?   68.83%           
=======================================
  Files           ?       62           
  Lines           ?     6848           
  Branches        ?        0           
=======================================
  Hits            ?     4714           
  Misses          ?     2134           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

return None

try:
import matplotlib # noqa: F401 -- imported here so we fail fast & loud
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just have this be a default in srtslurm

YAMY1234 and others added 3 commits April 29, 2026 13:17
live_metrics is a form of lightweight telemetry (log-polling +
in-process PNG snapshotter), so it belongs under the telemetry
block rather than reporting.

- Remove LiveMetricsConfig from ReportingConfig
- Add LiveMetricsConfig as an optional field on TelemetryConfig,
  with an updated docstring explaining the relationship
- Update try_start_snapshotter to read telemetry.live_metrics
  from the cluster config dict

Made-with: Cursor
… _validate_telemetry

- Add `telemetry: dict | None = None` to `ClusterConfig` so srtslurm.yaml
  can carry a `telemetry:` top-level block without failing marshmallow
  validation and causing model_paths/containers to silently disappear.
- Guard `_validate_telemetry()` against `telemetry is None` so recipes
  without a telemetry block do not crash with AttributeError.
The live-batch-metrics snapshotter requires matplotlib to render
batch_metrics.png during benchmarks. Without it, srtctl silently skips
the snapshotter. Making it a hard dependency ensures the feature works
out of the box after pip install.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants