[codex] Add generic telemetry and custom benchmark support by ishandhanani · Pull Request #43 · NVIDIA/srt-slurm

ishandhanani · 2026-04-17T07:25:11Z

What changed

This PR adds the srt-slurm pieces needed for benchmarking workflows without making srt-slurm itself user-facing.

added a first-class custom benchmark runner with optional benchmark-specific container and environment support
added a generic telemetry model and orchestration stage that launches exporter processes plus a scraper-compatible collector
kept the telemetry surface generic
hardened postprocess so raw logs still sync to S3 even when parsing fails
extended dry-run output to show custom benchmark and telemetry configuration
added focused test coverage for benchmark registration, telemetry config generation/startup, dry-run rendering, and resilient postprocess behavior

Why

no first-class custom benchmark path
no first-class telemetry concept in the orchestrator
postprocess could skip S3 upload if parsing failed, which made failure analysis brittle

Impact

benchmark authors can supply arbitrary benchmark commands without patching the orchestrator
telemetry can be enabled as a normal top-level config with provider-compatible structure
failed jobs retain raw logs and postprocess status in S3 more reliably
dry-run now surfaces the extra execution config users need to verify before launch

Root cause for the durability fix

The postprocess container used set -e and ran parsing before aws s3 sync, so a parser failure prevented raw artifacts from being uploaded at all. This PR makes parsing best-effort and keeps upload as the priority path.

- srtctl apply --json: emit one JSON line per submission on stdout (slurm_job_id, job_name, output_dir, metadata_path, config_path, tags). Prose goes to stderr. Errors emit {"status":"error","error":...} and exit non-zero. Module-level console is restored on exit so direct library callers of submit_* don't see a leaked stderr binding. - srtctl/mock.py: MockInfra context manager swapping 16 external surfaces (start_srun_process in 8 modules, hostname/IP resolution in 4, port and model health in 3, status HTTP in 1) with local fakes. FakePopen drop-in mimics subprocess.Popen. run_mock_sweep executes the full SweepOrchestrator against those fakes and writes realistic artifacts (status.json, status_events.jsonl, result.json, recipe.lock.yaml, logs). - srtctl.cli.mock_worker: CLI entry (`python -m srtctl.cli.mock_worker ...`) that wraps run_mock_sweep for spawning as a subprocess. - srtctl apply --mock [--mock-tick-s T]: stubs sbatch inside the real submit flow (submit_with_orchestrator still runs for real — config load, metadata write, JSON submission emission), then detaches a mock_worker subprocess that drives the full SweepOrchestrator against the output_dir. - Tests: test_apply_json (3), test_apply_mock (2), test_mock_sweep (3). - CI: new `mock-and-server` job explicitly runs the new test files plus test_integration_status, and smoke-tests both `srtctl apply --mock --json` and `python -m srtctl.cli.mock_worker` as real subprocesses.

On GitHub Actions runners RUNNER_NAME is auto-set, which causes get_job_name() to return the runner identity instead of the configured job name, tripping the job_name assertion.

ishandhanani marked this pull request as ready for review April 17, 2026 07:26

ishandhanani requested review from alec-flowers, csahithi and nlevin-ui as code owners April 17, 2026 07:26

ishandhanani added 2 commits April 17, 2026 19:05

Add custom benchmark and telemetry support

87aa302

ishandhanani force-pushed the codex/telemetry-custom-benchmark branch from af623f5 to c9856d7 Compare April 18, 2026 02:05

Unset RUNNER_NAME in test_apply_json so get_job_name returns config.name

b691693

On GitHub Actions runners RUNNER_NAME is auto-set, which causes get_job_name() to return the runner identity instead of the configured job name, tripping the job_name assertion.

ishandhanani merged commit f02b633 into main Apr 18, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add generic telemetry and custom benchmark support#43

[codex] Add generic telemetry and custom benchmark support#43
ishandhanani merged 3 commits intomainfrom
codex/telemetry-custom-benchmark

ishandhanani commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ishandhanani commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Impact

Root cause for the durability fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ishandhanani commented Apr 17, 2026 •

edited

Loading