iris: add --dry-run mode to controller startup by rjpower · Pull Request #4145 · marin-community/marin

rjpower · 2026-03-25T18:25:01Z

Add --dry-run flag to iris controller serve that starts the controller normally (loads config, restores from checkpoint, runs scheduling loop) but suppresses all side effects
Gates task dispatch, provider sync (worker RPCs), autoscaler VM changes, checkpoint writes, pruning, and task kills
Controller still serves dashboard + RPC so operators can inspect restored state
Useful for testing controller reloads and migrations locally before reloading the entire cluster

Gate all side effects (assignments, kills, checkpoints, provider sync, autoscaler updates, pruning) behind ControllerConfig.dry_run. The controller still starts up, loads config, restores from checkpoint, and runs scheduling computations — but never mutates external state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests verify that dry-run mode suppresses all side effects: - Controller starts/stops without errors - Scheduling runs but does not dispatch tasks - Provider sync is skipped - Autoscaler refreshes but does not update VMs - Checkpoint returns sentinel result - Pruning is skipped - Kill tasks are suppressed Also fixes CheckpointResult missing created_at in dry-run sentinel. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 125437549e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

rjpower · 2026-03-25T23:49:46Z

@claude what about loading the controller with an InMemoryGCP service and a fake worker stub creator instead of hacking "if dry_run" everywhere? would that work

claude · 2026-03-25T23:50:21Z

Claude finished @rjpower's task in 1m 19s —— View job

Analysis: DI-based dry-run vs `if dry_run` guards

Read current PR implementation
Examine controller dependencies (GCP service, worker stubs, autoscaler)
Assess feasibility of DI-based approach

Good instinct — the DI approach is cleaner in principle, but the side-effect surface here is wider than what fake providers alone would cover. Here's the breakdown:

What fake providers would cover (≈ half the guards):

Side effect	Current guard location	Fakeable?
Worker RPCs (heartbeat/dispatch)	`_sync_all_execution_units`	✅ — `FakeProvider.sync()` already no-ops
Autoscaler VM changes	`_run_autoscaler_once`	✅ — a no-op `Autoscaler` (or `None`) works
Task kill RPCs	`kill_tasks_on_workers`	✅ — routes through provider

What fake providers wouldn't cover:

Side effect	Why DI doesn't help
Checkpoint writes (`begin_checkpoint`, `_atexit_checkpoint`, periodic)	Writes to GCS via `write_checkpoint(db, remote_state_dir)` — not routed through any injected service. You'd need a fake `fsspec` backend or a separate `CheckpointWriter` abstraction.
DB mutations via `ControllerTransitions` (`_commit_assignments`, `_mark_task_unschedulable`, `replace_reservation_claims`, `prune_old_data`)	These write directly to the local SQLite DB. In dry-run you want to compute scheduling but not commit assignments. Transitions isn't an injected dependency — it's constructed internally from the DB.
Profile upload thread	Skipped in dry-run via `start()` — not provider-related.

So you'd still need guards for: checkpoint writes, assignment commits, pruning, unschedulable marking, and reservation claim writes. That's 6-7 of the ~12 current if dry_run checks that wouldn't go away.

A middle path that might work better:

Instead of either extreme, you could introduce a SideEffectGate (or just a ReadOnlyTransitions wrapper) that the controller consults for all mutating operations:

@dataclass
class SideEffectGate:
    """Controls whether the controller may perform side effects."""
    enabled: bool = True

    def guard(self, label: str) -> bool:
        if not self.enabled:
            logger.info("[DRY-RUN] Suppressed: %s", label)
            return False
        return True

Then the checks become if self._gate.guard("checkpoint"): write_checkpoint(...) — still explicit, but the dry-run logic is centralized in one object rather than self._config.dry_run scattered across 12 call sites. It also makes it easy to add metrics (count of suppressed operations) or to make it togglable at runtime via the dashboard.

But honestly — for a feature this focused (operator inspects restored state, then Ctrl-C), the current if dry_run approach is fine. It's ~12 checks in a 1600-line file, they're all one-liners, and they're easy to grep for. The DI approach would require new abstractions (NoOpAutoscaler, NoOpCheckpointWriter, ReadOnlyTransitions) that don't earn their keep for a flag that's only used in manual debugging sessions.

The Codex review point is valid though — refresh() calls scale_down() and scale_down_if_idle() which terminate VMs. The dry-run guard at line 1595 needs to move before self._autoscaler.refresh(worker_status_map), or you should pass autoscaler=None in dry-run mode (which is arguably the one place where DI would be the cleanest fix).

rjpower · 2026-03-25T23:54:58Z

@claude merge with main, fix the issue from codex review. identify any other missing mutations to external state that we might have left without dry_run guards

claude · 2026-03-25T23:55:33Z

Claude finished @rjpower's task in 4m 58s —— View job

Merging with main, fixing Codex review issue, and auditing dry-run guards

Merge branch with main
Read current controller code and PR changes
Fix Codex review: move dry-run guard before autoscaler refresh
Audit for missing dry-run guards on external state mutations
Run pre-commit and tests
Push changes

What changed

Fixed the Codex review issue: moved the dry-run guard in _run_autoscaler_once() to before self._autoscaler.refresh(worker_status_map). Previously, refresh() ran before the guard and could call scale_down() / scale_down_if_idle() which terminate VMs — breaking the dry-run safety contract. Now the entire autoscaler cycle (both refresh and update) is skipped in dry-run mode.

Updated the test from test_dry_run_autoscaler_refresh_but_no_update → test_dry_run_autoscaler_skipped_entirely to assert neither refresh nor update are called.

Audit: all external side effects are properly guarded

Side effect	Guard location	Status
Task dispatch (`queue_assignments`)	`_buffer_assignments` L1380	✅
Mark unschedulable (`mark_task_unschedulable`)	`_mark_task_unschedulable` L1391	✅
Kill tasks on workers	`kill_tasks_on_workers` L1423	✅
Reap stale workers (`fail_workers_by_ids`)	`_reap_stale_workers` L1459	✅
Provider sync (worker RPCs)	`_sync_all_execution_units` L1483	✅
Direct provider sync (K8s)	`_sync_direct_provider` L1017	✅
Autoscaler (VM changes)	`_run_autoscaler_once` L1590	✅ fixed
Checkpoint writes	`begin_checkpoint` L1622	✅
Atexit checkpoint	`_atexit_checkpoint` L921	✅
Periodic checkpoint	`_run_autoscaler_loop` L977	✅
Data pruning	`_maybe_prune` L948	✅
Reservation claim writes	`_run_scheduling` L1241	✅
Profile RPCs	Thread not started (L845)	✅

No missing guards found — the autoscaler was the only gap.

Testing

./infra/pre-commit.py --all-files --fix — ✅ passed
uv run pytest tests/cluster/controller/test_dry_run.py -v — ✅ 7/7 passed
· Branch multi/jXeCrjuj

…e effects Autoscaler.refresh() can call scale_down() and terminate VMs, which violates the dry-run safety contract. Move the guard to skip the entire autoscaler cycle (both refresh and update) in dry-run mode. Updates the test to assert neither refresh nor update are called. Co-authored-by: Russell Power <rjpower@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add `iris cluster controller serve` subcommand that delegates to `run_controller_serve()`, reusing the config already loaded by the iris CLI group - Extract serve body into `run_controller_serve()` so both the CLI subcommand and standalone entrypoint (Dockerfile/bootstrap) share it - In dry-run mode: use a temp dir for local state (avoids /var/cache permission errors on dev machines), skip autoscaler/provider bundle creation (avoids requiring GCP credentials locally) - Fix signal handler to reset on first Ctrl+C so second signal force-exits instead of re-entering a hanging shutdown - Fix misleading "writing final checkpoint" log in dry-run mode

yonromai

Approved.

The dry-run control-flow itself looks good, and the autoscaler-refresh issue from the earlier Codex review is fixed. I found two follow-ups worth addressing before relying on this for operator workflows:

expected startup/config errors now escape as tracebacks instead of clean Click errors
BeginCheckpoint reports synthetic zero counts in dry-run instead of the restored DB state

Targeted validation passed locally:
uv run --package iris --group dev pytest lib/iris/tests/cluster/controller/test_dry_run.py -q (7 passed)

Generated with Codex

yonromai · 2026-03-26T00:55:02Z

+
+    configure_logging(level=getattr(logging, log_level))
+
+    cluster_config = load_config(Path(config_file))


This regresses the operator UX for expected startup failures. Before the refactor, serve() converted config/load/startup errors into click.ClickException, so a missing checkpoint or bad config produced a clean CLI error. Now load_config(...) here and the ValueErrors from run_controller_serve() bubble out of both CLI entrypoints, which means iris ... controller serve will dump a traceback for normal operator mistakes.

Can we wrap the shared startup path in click.ClickException again (both here and in iris.cli.cluster.controller_serve) so expected validation/startup failures stay user-facing?

Generated with Codex

we could, but it was already raising the full exception trace, so i think i'm fine with it this way

yonromai · 2026-03-26T00:55:02Z

        """Pause loops and write a consistent SQLite checkpoint copy."""
+        if self._config.dry_run:
+            logger.info("[DRY-RUN] Skipping checkpoint write")
+            return ("dry-run", CheckpointResult(created_at=Timestamp.now(), job_count=0, task_count=0, worker_count=0))


In dry-run mode this returns a synthetic CheckpointResult with zero jobs/tasks/workers regardless of what was restored into the DB. That makes BeginCheckpoint lie about the controller state: iris cluster controller checkpoint will happily report 0 jobs, 0 tasks, 0 workers against a populated dry-run controller.

Can we either reject BeginCheckpoint in dry-run with a clear error, or compute the counts from the current DB snapshot while still skipping the actual checkpoint write?

Generated with Codex

this is an explicit call a user would make in practice, so i think this is fine as is. the intention of dry-run mode is to ensure things like DB migrations worked properly, not exercise actual checkpoint logic

- Add `--dry-run` flag to `iris controller serve` that starts the controller normally (loads config, restores from checkpoint, runs scheduling loop) but suppresses all side effects - Gates task dispatch, provider sync (worker RPCs), autoscaler VM changes, checkpoint writes, pruning, and task kills - Controller still serves dashboard + RPC so operators can inspect restored state - Useful for testing controller reloads and migrations locally before reloading the entire cluster Closes #4141

rjpower and others added 2 commits March 25, 2026 11:15

rjpower added the agent-generated Created by automation/agent label Mar 25, 2026

chatgpt-codex-connector Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread lib/iris/src/iris/cluster/controller/controller.py Outdated

github-actions Bot and others added 2 commits March 25, 2026 23:56

Merge remote-tracking branch 'origin/main' into multi/jXeCrjuj

fa474c7

rjpower requested a review from yonromai March 26, 2026 00:05

rjpower added 2 commits March 25, 2026 17:22

remove unneeded skill tree.

92c805f

yonromai approved these changes Mar 26, 2026

View reviewed changes

rjpower merged commit 405009c into main Mar 26, 2026
62 checks passed

rjpower deleted the multi/jXeCrjuj branch March 26, 2026 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iris: add --dry-run mode to controller startup#4145

iris: add --dry-run mode to controller startup#4145
rjpower merged 6 commits intomainfrom
multi/jXeCrjuj

rjpower commented Mar 25, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

rjpower commented Mar 25, 2026

Uh oh!

claude Bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

rjpower commented Mar 25, 2026

Uh oh!

claude Bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

yonromai left a comment

Uh oh!

yonromai Mar 26, 2026

Uh oh!

rjpower Mar 26, 2026

Uh oh!

yonromai Mar 26, 2026

Uh oh!

rjpower Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		configure_logging(level=getattr(logging, log_level))

		cluster_config = load_config(Path(config_file))

Conversation

rjpower commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

rjpower commented Mar 25, 2026

Uh oh!

claude Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analysis: DI-based dry-run vs if dry_run guards

Uh oh!

rjpower commented Mar 25, 2026

Uh oh!

claude Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging with main, fixing Codex review issue, and auditing dry-run guards

What changed

Audit: all external side effects are properly guarded

Testing

Uh oh!

yonromai left a comment

Choose a reason for hiding this comment

Uh oh!

yonromai Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjpower commented Mar 25, 2026 •

edited

Loading

claude Bot commented Mar 25, 2026 •

edited

Loading

Analysis: DI-based dry-run vs `if dry_run` guards

claude Bot commented Mar 25, 2026 •

edited

Loading