iris: add --dry-run mode to controller startup#4145
Conversation
Gate all side effects (assignments, kills, checkpoints, provider sync, autoscaler updates, pruning) behind ControllerConfig.dry_run. The controller still starts up, loads config, restores from checkpoint, and runs scheduling computations — but never mutates external state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests verify that dry-run mode suppresses all side effects: - Controller starts/stops without errors - Scheduling runs but does not dispatch tasks - Provider sync is skipped - Autoscaler refreshes but does not update VMs - Checkpoint returns sentinel result - Pruning is skipped - Kill tasks are suppressed Also fixes CheckpointResult missing created_at in dry-run sentinel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 125437549e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@claude what about loading the controller with an InMemoryGCP service and a fake worker stub creator instead of hacking "if dry_run" everywhere? would that work |
|
Claude finished @rjpower's task in 1m 19s —— View job Analysis: DI-based dry-run vs
|
| Side effect | Current guard location | Fakeable? |
|---|---|---|
| Worker RPCs (heartbeat/dispatch) | _sync_all_execution_units |
✅ — FakeProvider.sync() already no-ops |
| Autoscaler VM changes | _run_autoscaler_once |
✅ — a no-op Autoscaler (or None) works |
| Task kill RPCs | kill_tasks_on_workers |
✅ — routes through provider |
What fake providers wouldn't cover:
| Side effect | Why DI doesn't help |
|---|---|
Checkpoint writes (begin_checkpoint, _atexit_checkpoint, periodic) |
Writes to GCS via write_checkpoint(db, remote_state_dir) — not routed through any injected service. You'd need a fake fsspec backend or a separate CheckpointWriter abstraction. |
DB mutations via ControllerTransitions (_commit_assignments, _mark_task_unschedulable, replace_reservation_claims, prune_old_data) |
These write directly to the local SQLite DB. In dry-run you want to compute scheduling but not commit assignments. Transitions isn't an injected dependency — it's constructed internally from the DB. |
| Profile upload thread | Skipped in dry-run via start() — not provider-related. |
So you'd still need guards for: checkpoint writes, assignment commits, pruning, unschedulable marking, and reservation claim writes. That's 6-7 of the ~12 current if dry_run checks that wouldn't go away.
A middle path that might work better:
Instead of either extreme, you could introduce a SideEffectGate (or just a ReadOnlyTransitions wrapper) that the controller consults for all mutating operations:
@dataclass
class SideEffectGate:
"""Controls whether the controller may perform side effects."""
enabled: bool = True
def guard(self, label: str) -> bool:
if not self.enabled:
logger.info("[DRY-RUN] Suppressed: %s", label)
return False
return TrueThen the checks become if self._gate.guard("checkpoint"): write_checkpoint(...) — still explicit, but the dry-run logic is centralized in one object rather than self._config.dry_run scattered across 12 call sites. It also makes it easy to add metrics (count of suppressed operations) or to make it togglable at runtime via the dashboard.
But honestly — for a feature this focused (operator inspects restored state, then Ctrl-C), the current if dry_run approach is fine. It's ~12 checks in a 1600-line file, they're all one-liners, and they're easy to grep for. The DI approach would require new abstractions (NoOpAutoscaler, NoOpCheckpointWriter, ReadOnlyTransitions) that don't earn their keep for a flag that's only used in manual debugging sessions.
The Codex review point is valid though — refresh() calls scale_down() and scale_down_if_idle() which terminate VMs. The dry-run guard at line 1595 needs to move before self._autoscaler.refresh(worker_status_map), or you should pass autoscaler=None in dry-run mode (which is arguably the one place where DI would be the cleanest fix).
|
@claude merge with main, fix the issue from codex review. identify any other missing mutations to external state that we might have left without dry_run guards |
|
Claude finished @rjpower's task in 4m 58s —— View job Merging with main, fixing Codex review issue, and auditing dry-run guards
What changedFixed the Codex review issue: moved the dry-run guard in Updated the test from Audit: all external side effects are properly guarded
No missing guards found — the autoscaler was the only gap. Testing
|
…e effects Autoscaler.refresh() can call scale_down() and terminate VMs, which violates the dry-run safety contract. Move the guard to skip the entire autoscaler cycle (both refresh and update) in dry-run mode. Updates the test to assert neither refresh nor update are called. Co-authored-by: Russell Power <rjpower@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add `iris cluster controller serve` subcommand that delegates to `run_controller_serve()`, reusing the config already loaded by the iris CLI group - Extract serve body into `run_controller_serve()` so both the CLI subcommand and standalone entrypoint (Dockerfile/bootstrap) share it - In dry-run mode: use a temp dir for local state (avoids /var/cache permission errors on dev machines), skip autoscaler/provider bundle creation (avoids requiring GCP credentials locally) - Fix signal handler to reset on first Ctrl+C so second signal force-exits instead of re-entering a hanging shutdown - Fix misleading "writing final checkpoint" log in dry-run mode
yonromai
left a comment
There was a problem hiding this comment.
Approved.
The dry-run control-flow itself looks good, and the autoscaler-refresh issue from the earlier Codex review is fixed. I found two follow-ups worth addressing before relying on this for operator workflows:
- expected startup/config errors now escape as tracebacks instead of clean Click errors
BeginCheckpointreports synthetic zero counts in dry-run instead of the restored DB state
Targeted validation passed locally:
uv run --package iris --group dev pytest lib/iris/tests/cluster/controller/test_dry_run.py -q (7 passed)
Generated with Codex
|
|
||
| configure_logging(level=getattr(logging, log_level)) | ||
|
|
||
| cluster_config = load_config(Path(config_file)) |
There was a problem hiding this comment.
This regresses the operator UX for expected startup failures. Before the refactor, serve() converted config/load/startup errors into click.ClickException, so a missing checkpoint or bad config produced a clean CLI error. Now load_config(...) here and the ValueErrors from run_controller_serve() bubble out of both CLI entrypoints, which means iris ... controller serve will dump a traceback for normal operator mistakes.
Can we wrap the shared startup path in click.ClickException again (both here and in iris.cli.cluster.controller_serve) so expected validation/startup failures stay user-facing?
Generated with Codex
There was a problem hiding this comment.
we could, but it was already raising the full exception trace, so i think i'm fine with it this way
| """Pause loops and write a consistent SQLite checkpoint copy.""" | ||
| if self._config.dry_run: | ||
| logger.info("[DRY-RUN] Skipping checkpoint write") | ||
| return ("dry-run", CheckpointResult(created_at=Timestamp.now(), job_count=0, task_count=0, worker_count=0)) |
There was a problem hiding this comment.
In dry-run mode this returns a synthetic CheckpointResult with zero jobs/tasks/workers regardless of what was restored into the DB. That makes BeginCheckpoint lie about the controller state: iris cluster controller checkpoint will happily report 0 jobs, 0 tasks, 0 workers against a populated dry-run controller.
Can we either reject BeginCheckpoint in dry-run with a clear error, or compute the counts from the current DB snapshot while still skipping the actual checkpoint write?
Generated with Codex
There was a problem hiding this comment.
this is an explicit call a user would make in practice, so i think this is fine as is. the intention of dry-run mode is to ensure things like DB migrations worked properly, not exercise actual checkpoint logic
- Add `--dry-run` flag to `iris controller serve` that starts the controller normally (loads config, restores from checkpoint, runs scheduling loop) but suppresses all side effects - Gates task dispatch, provider sync (worker RPCs), autoscaler VM changes, checkpoint writes, pruning, and task kills - Controller still serves dashboard + RPC so operators can inspect restored state - Useful for testing controller reloads and migrations locally before reloading the entire cluster Closes #4141
--dry-runflag toiris controller servethat starts the controller normally (loads config, restores from checkpoint, runs scheduling loop) but suppresses all side effectsCloses #4141