Skip to content

[codex] Add generic telemetry and custom benchmark support#43

Merged
ishandhanani merged 3 commits intomainfrom
codex/telemetry-custom-benchmark
Apr 18, 2026
Merged

[codex] Add generic telemetry and custom benchmark support#43
ishandhanani merged 3 commits intomainfrom
codex/telemetry-custom-benchmark

Conversation

@ishandhanani
Copy link
Copy Markdown
Collaborator

@ishandhanani ishandhanani commented Apr 17, 2026

What changed

This PR adds the srt-slurm pieces needed for benchmarking workflows without making srt-slurm itself user-facing.

  • added a first-class custom benchmark runner with optional benchmark-specific container and environment support
  • added a generic telemetry model and orchestration stage that launches exporter processes plus a scraper-compatible collector
  • kept the telemetry surface generic
  • hardened postprocess so raw logs still sync to S3 even when parsing fails
  • extended dry-run output to show custom benchmark and telemetry configuration
  • added focused test coverage for benchmark registration, telemetry config generation/startup, dry-run rendering, and resilient postprocess behavior

Why

  • no first-class custom benchmark path
  • no first-class telemetry concept in the orchestrator
  • postprocess could skip S3 upload if parsing failed, which made failure analysis brittle

Impact

  • benchmark authors can supply arbitrary benchmark commands without patching the orchestrator
  • telemetry can be enabled as a normal top-level config with provider-compatible structure
  • failed jobs retain raw logs and postprocess status in S3 more reliably
  • dry-run now surfaces the extra execution config users need to verify before launch

Root cause for the durability fix

The postprocess container used set -e and ran parsing before aws s3 sync, so a parser failure prevented raw artifacts from being uploaded at all. This PR makes parsing best-effort and keeps upload as the priority path.

@ishandhanani ishandhanani marked this pull request as ready for review April 17, 2026 07:26
- srtctl apply --json: emit one JSON line per submission on stdout (slurm_job_id,
  job_name, output_dir, metadata_path, config_path, tags). Prose goes to stderr.
  Errors emit {"status":"error","error":...} and exit non-zero. Module-level
  console is restored on exit so direct library callers of submit_* don't see
  a leaked stderr binding.

- srtctl/mock.py: MockInfra context manager swapping 16 external surfaces
  (start_srun_process in 8 modules, hostname/IP resolution in 4, port and
  model health in 3, status HTTP in 1) with local fakes. FakePopen drop-in
  mimics subprocess.Popen. run_mock_sweep executes the full
  SweepOrchestrator against those fakes and writes realistic artifacts
  (status.json, status_events.jsonl, result.json, recipe.lock.yaml, logs).

- srtctl.cli.mock_worker: CLI entry (`python -m srtctl.cli.mock_worker ...`)
  that wraps run_mock_sweep for spawning as a subprocess.

- srtctl apply --mock [--mock-tick-s T]: stubs sbatch inside the real submit
  flow (submit_with_orchestrator still runs for real — config load, metadata
  write, JSON submission emission), then detaches a mock_worker subprocess
  that drives the full SweepOrchestrator against the output_dir.

- Tests: test_apply_json (3), test_apply_mock (2), test_mock_sweep (3).

- CI: new `mock-and-server` job explicitly runs the new test files plus
  test_integration_status, and smoke-tests both `srtctl apply --mock --json`
  and `python -m srtctl.cli.mock_worker` as real subprocesses.
@ishandhanani ishandhanani force-pushed the codex/telemetry-custom-benchmark branch from af623f5 to c9856d7 Compare April 18, 2026 02:05
On GitHub Actions runners RUNNER_NAME is auto-set, which causes
get_job_name() to return the runner identity instead of the configured
job name, tripping the job_name assertion.
@ishandhanani ishandhanani merged commit f02b633 into main Apr 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant