Skip to content

Push logs_url to status API eagerly and via final PUT#54

Merged
ishandhanani merged 3 commits intomainfrom
ishandhanani/status-reporter-eager-artifacts-push
Apr 21, 2026
Merged

Push logs_url to status API eagerly and via final PUT#54
ishandhanani merged 3 commits intomainfrom
ishandhanani/status-reporter-eager-artifacts-push

Conversation

@ishandhanani
Copy link
Copy Markdown
Collaborator

@ishandhanani ishandhanani commented Apr 21, 2026

Summary

  • Fix: recipe-defined reporting.status.endpoints were silently skipped by the post-process dashboard PUT. The path only read load_cluster_config(), so any endpoint set in the recipe (rather than cluster srtslurm.yaml) never received logs_url. Consequence: the collector recorded job lifecycle correctly but logs_url stayed null, and consumers could not find the S3 artifacts.
  • Fix: report_completed fired before run_postprocess in do_sweep.run()'s finally block, so the completed PUT reached the collector before any rollup was generated.
  • Feature: new StatusReporter.report_artifacts(logs_url) for eager mid-run pushes — fires as soon as S3 sync completes, before AI analysis runs, so a hanging or crashing analyzer cannot strand the artifact pointer.

Changes

File What
src/srtctl/core/status.py report_completed(exit_code, logs_url=None) kwarg; new report_artifacts(logs_url) that PUTs at status=benchmark (does not flip lifecycle)
src/srtctl/cli/mixins/postprocess_stage.py run_postprocess(exit_code, reporter=None); eager reporter.report_artifacts after S3 sync; stash self._last_logs_url; removed the broken _report_metrics
src/srtctl/cli/do_sweep.py finally reordered so run_postprocess runs before report_completed; passes reporter in and pulls logs_url back for the final PUT
tests/test_status.py Coverage for logs_url kwarg and report_artifacts (disabled, empty URL, success without flipping to completed)
tests/test_postprocess.py Replaced _report_metrics assertions with the eager-push + stash shape; added skip-paths for no-S3-URL and no-reporter cases
src/srtctl/core/config.py, src/srtctl/core/__init__.py Small helper refactor pending in the tree: extract find_cluster_config_path, add get_cluster_aliases

Why two PUTs

Defense in depth: the eager report_artifacts PUT fires the moment S3 sync returns, independent of whether the AI analyzer runs, times out, or crashes. The final report_completed in do_sweep then reasserts the same logs_url idempotently at status=completed. If either PUT succeeds, the collector has the pointer.

Compat

  • report_completed(exit_code) still works — logs_url is optional. No call-site changes required for existing callers.
  • No changes to the Status API v1 wire contract. logs_url has always been a first-class field in JobUpdatePayload.

Test plan

  • uv run pytest tests/ -q — 589 passed, 2 skipped, 6 deselected
  • uv run ruff check on touched files — all pass
  • uv run ruff format --check on touched files — all pass
  • Live verification against the Brev collector once a benchmark recipe with reporting.status.endpoint lands on a cluster with reporting.s3 configured — expected: GET /api/jobs/{id} returns logs_url pointing at the S3 prefix.

🤖 Generated with Claude Code

Previously, when a recipe set `reporting.status.endpoint` but the cluster's
srtslurm.yaml did not, `_report_metrics` in PostProcessStageMixin would
silently no-op because it only read `load_cluster_config()`. The final
`report_completed` PUT also fired before `run_postprocess`, so the
"completed" status hit the collector before any rollup was generated.
Net effect: the collector recorded lifecycle correctly but `logs_url`
stayed null, and consumers (e.g. IBAR) could not find the S3 artifacts.

Changes:
- `StatusReporter.report_completed(exit_code, logs_url=None)` — final PUT
  now carries the artifact pointer when available.
- New `StatusReporter.report_artifacts(logs_url)` — eager mid-run PUT that
  updates `logs_url` without flipping the lifecycle to `completed`.
  Fires as soon as S3 sync returns, before AI analysis runs, so a hanging
  or crashing analyzer cannot strand the pointer.
- `run_postprocess(exit_code, reporter=None)` — accepts the reporter,
  pushes `logs_url` eagerly, and stashes it on `self._last_logs_url`
  for the caller's final `report_completed`.
- Removed `_report_metrics`: duplicate PUT path that only read cluster
  config, missed recipe-defined endpoints, and is now fully covered by
  the reporter-based path that resolves both sources via `from_config`.
- `do_sweep.run()` finally block reordered: postprocess runs before
  `report_completed`, and the final PUT reasserts `logs_url` idempotently.

Benchmark results themselves are intentionally NOT forwarded through the
status API. S3 remains the source of truth for artifacts; the collector
only stores the pointer.

Also bundles a small helper refactor in `core/config.py` (extract
`find_cluster_config_path`, add `get_cluster_aliases`) that was pending
in the working tree.

Tests:
- `test_status.py`: coverage for the new `logs_url` kwarg on
  `report_completed` and three cases for `report_artifacts` (disabled,
  empty URL, success without flipping status to completed).
- `test_postprocess.py`: replaced `_report_metrics` assertions with the
  new shape — eager reporter push, stash on self, skip paths for no S3
  URL and no reporter.

All 589 tests pass. Ruff clean on touched files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ishandhanani and others added 2 commits April 21, 2026 01:06
Four references in comments/docstrings/CLI help were naming a specific
external harness. Replaced with provider-neutral language ("upstream
harnesses", "downstream consumers"). No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ishandhanani ishandhanani merged commit 713b872 into main Apr 21, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant