feat: live in-flight batch-metrics snapshotter (opt-in)#115
Draft
YAMY1234 wants to merge 5 commits intoNVIDIA:mainfrom
Draft
feat: live in-flight batch-metrics snapshotter (opt-in)#115YAMY1234 wants to merge 5 commits intoNVIDIA:mainfrom
YAMY1234 wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
Add an opt-in background daemon that re-parses prefill/decode worker
logs every N seconds during the benchmark stage and overwrites
`<log_dir>/batch_metrics.png` in place. Gives near-real-time visibility
into running-req / queue-req / throughput / KV occupancy without any
external monitoring stack — the orchestrator already shares the
filesystem with worker logs, so we read them locally.
Behaviour change is gated by `srtslurm.yaml`:
reporting:
live_metrics:
enabled: true
interval_seconds: 60 # default 60, min 5
downsample: 1 # keep every Nth point when plotting
When disabled (default), this is a strict no-op on the benchmark path:
`try_start_snapshotter()` returns None on the first config check.
Code layout keeps the orchestrator hook minimal:
src/srtctl/cli/mixins/benchmark_stage.py +16 / -7
one local import + one helper call + try/finally around the
existing srun wait loop. No new mixin methods, no analysis-package
internals leaking into the orchestrator.
src/srtctl/core/schema.py +22
new LiveMetricsConfig dataclass plugged into ReportingConfig.
src/srtctl/analysis/__init__.py (new)
src/srtctl/analysis/batch_log_parser.py (new)
pure parser: per-file FileSeries with a byte_offset cache so
successive ticks only read newly appended bytes. No aggregation,
no plotting; reusable from any consumer.
src/srtctl/analysis/live_metrics.py (new)
snapshotter daemon, simple per-worker matplotlib renderer (one
line per worker file per metric — no DP-rank scaling guesses, no
cluster aggregation), and `try_start_snapshotter()` orchestrator-
facing helper. All failures (matplotlib missing, malformed
cluster config, render error) are logged and swallowed.
Aggregated cluster-wide views (sum-across-workers etc.) are
intentionally left for a follow-up so they can be reviewed and tuned
independently of the snapshotter mechanics.
Made-with: Cursor
- batch_log_parser: label now emits "p0"/"d3" style short tags extracted from the log filename suffix instead of the full path - live_metrics: replace flat per-metric columns with semantic (prefill, decode) row pairs driven by _PLOT_ROWS; drop prefill #running-req; ensure decode #queue-req/#transfer-req always appear; switch colour map to tab20 for 20-worker support Made-with: Cursor
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #115 +/- ##
=======================================
Coverage ? 68.83%
=======================================
Files ? 62
Lines ? 6848
Branches ? 0
=======================================
Hits ? 4714
Misses ? 2134
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| return None | ||
|
|
||
| try: | ||
| import matplotlib # noqa: F401 -- imported here so we fail fast & loud |
Collaborator
There was a problem hiding this comment.
just have this be a default in srtslurm
live_metrics is a form of lightweight telemetry (log-polling + in-process PNG snapshotter), so it belongs under the telemetry block rather than reporting. - Remove LiveMetricsConfig from ReportingConfig - Add LiveMetricsConfig as an optional field on TelemetryConfig, with an updated docstring explaining the relationship - Update try_start_snapshotter to read telemetry.live_metrics from the cluster config dict Made-with: Cursor
… _validate_telemetry - Add `telemetry: dict | None = None` to `ClusterConfig` so srtslurm.yaml can carry a `telemetry:` top-level block without failing marshmallow validation and causing model_paths/containers to silently disappear. - Guard `_validate_telemetry()` against `telemetry is None` so recipes without a telemetry block do not crash with AttributeError.
The live-batch-metrics snapshotter requires matplotlib to render batch_metrics.png during benchmarks. Without it, srtctl silently skips the snapshotter. Making it a hard dependency ensures the feature works out of the box after pip install. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in, zero-dependency-on-production-path feature that spawns a daemon thread during the benchmark stage.
Every N seconds it re-parses prefill/decode worker logs and atomically overwrites
<log_dir>/batch_metrics.png, giving a near-real-time view of the run without any external monitoring stack.live_metricslives under thetelemetryblock insrtslurm.yaml— it is a lightweight complementary signal (log polling, no scraper container required) that fits naturally alongside the existing DCGM/node_exporter telemetry.Files changed
src/srtctl/analysis/__init__.pyanalysissubpackagesrc/srtctl/analysis/batch_log_parser.pysrc/srtctl/analysis/live_metrics.pysrc/srtctl/core/schema.pyLiveMetricsConfigdataclass nested underTelemetryConfigsrc/srtctl/cli/mixins/benchmark_stage.pyHow to enable
In
srtslurm.yaml:When disabled (default), zero code paths are touched —
try_start_snapshotterreturnsNoneimmediately.Plot layout
7 rows × 2 columns, semantically paired (prefill left, decode right):
Worker labels are shortened to
p0…pN/d0…dN. Colour map istab20(supports up to 20 workers).Design notes
byte_offsetper file — safe to call repeatedly on growing logs.fig.savefig(tmp) → os.replace(tmp, dst)— readers never see a torn file.matplotlibis an optional soft dependency; the snapshotter degrades gracefully if missing.Test plan
491-tom-radixcachelogs — plot generated correctlyenabled: false(default) — confirm no behaviour change in existing benchmarks