Skip to content

Built-in GPU performance monitoring during benchmarks#35

Open
KaunilD wants to merge 3 commits intoNVIDIA:mainfrom
KaunilD:kdhruv/gweperf_integration
Open

Built-in GPU performance monitoring during benchmarks#35
KaunilD wants to merge 3 commits intoNVIDIA:mainfrom
KaunilD:kdhruv/gweperf_integration

Conversation

@KaunilD
Copy link
Copy Markdown
Collaborator

@KaunilD KaunilD commented Apr 13, 2026

Summary

  • Add lightweight GPU performance monitor (perfmon.py) that polls nvidia-smi during benchmarks, writing per-node CSV time-series (util, memory, power, temp) and aggregate JSON summaries
  • Monitor all worker nodes including the head node (head runs GPU workers in most topologies)
  • Auto-resolve HF hub cache snapshot directories in sa-bench and mooncake-router benchmark scripts, fixing tokenizer loading when model.path points to a hub cache root with symlinked snapshots

Test plan

  • tests/test_monitoring.py — 19 tests passing
  • E2E: verify perf_samples_{node}.csv appears for all nodes including head
  • Verify tokenizer resolution with HF hub cache model directories

KaunilD and others added 3 commits April 7, 2026 15:42
…tion

Starts one gweperf process per worker node before the benchmark and stops
it cleanly (SIGINT + 30s timeout fallback to SIGKILL) after. Adds
MonitoringConfig/MonitoringFeaturesConfig schema and gweperf_path cluster
setting. Monitoring failures are non-fatal; benchmark continues unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes the external gweperf dependency and srtslurm.yaml gweperf_path
requirement. Adds src/srtctl/monitor/perfmon.py — a ~90-line script that
polls nvidia-smi and writes per-second CSV samples + aggregate JSON on
SIGINT. Drops MonitoringFeaturesConfig; MonitoringConfig is now just
enabled + sample_interval.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The head node runs GPU workers in most topologies, so excluding it
from perfmon left a gap in power/utilization metrics. Remove the
head-node filter so all worker nodes get monitored.

Made-with: Cursor
@KaunilD KaunilD force-pushed the kdhruv/gweperf_integration branch from e36a925 to cf68041 Compare April 13, 2026 20:49
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 59.57447% with 19 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@896eabe). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/srtctl/cli/mixins/benchmark_stage.py 54.76% 19 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #35   +/-   ##
=======================================
  Coverage        ?   62.39%           
=======================================
  Files           ?       52           
  Lines           ?     5053           
  Branches        ?        0           
=======================================
  Hits            ?     3153           
  Misses          ?     1900           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@KaunilD
Copy link
Copy Markdown
Collaborator Author

KaunilD commented Apr 14, 2026

Screenshot 2026-04-14 at 2 39 26 PM uploading graphs created with the collected data. aligned using the timestamp from benchmark.out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants