Skip to content

iris: add --dry-run mode to controller startup#4145

Merged
rjpower merged 6 commits intomainfrom
multi/jXeCrjuj
Mar 26, 2026
Merged

iris: add --dry-run mode to controller startup#4145
rjpower merged 6 commits intomainfrom
multi/jXeCrjuj

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Mar 25, 2026

  • Add --dry-run flag to iris controller serve that starts the controller normally (loads config, restores from checkpoint, runs scheduling loop) but suppresses all side effects
  • Gates task dispatch, provider sync (worker RPCs), autoscaler VM changes, checkpoint writes, pruning, and task kills
  • Controller still serves dashboard + RPC so operators can inspect restored state
  • Useful for testing controller reloads and migrations locally before reloading the entire cluster

Closes #4141

rjpower and others added 2 commits March 25, 2026 11:15
Gate all side effects (assignments, kills, checkpoints, provider sync,
autoscaler updates, pruning) behind ControllerConfig.dry_run. The
controller still starts up, loads config, restores from checkpoint, and
runs scheduling computations — but never mutates external state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests verify that dry-run mode suppresses all side effects:
- Controller starts/stops without errors
- Scheduling runs but does not dispatch tasks
- Provider sync is skipped
- Autoscaler refreshes but does not update VMs
- Checkpoint returns sentinel result
- Pruning is skipped
- Kill tasks are suppressed

Also fixes CheckpointResult missing created_at in dry-run sentinel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rjpower rjpower added the agent-generated Created by automation/agent label Mar 25, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 125437549e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/controller/controller.py Outdated
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 25, 2026

@claude what about loading the controller with an InMemoryGCP service and a fake worker stub creator instead of hacking "if dry_run" everywhere? would that work

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 25, 2026

Claude finished @rjpower's task in 1m 19s —— View job


Analysis: DI-based dry-run vs if dry_run guards

  • Read current PR implementation
  • Examine controller dependencies (GCP service, worker stubs, autoscaler)
  • Assess feasibility of DI-based approach

Good instinct — the DI approach is cleaner in principle, but the side-effect surface here is wider than what fake providers alone would cover. Here's the breakdown:

What fake providers would cover (≈ half the guards):

Side effect Current guard location Fakeable?
Worker RPCs (heartbeat/dispatch) _sync_all_execution_units ✅ — FakeProvider.sync() already no-ops
Autoscaler VM changes _run_autoscaler_once ✅ — a no-op Autoscaler (or None) works
Task kill RPCs kill_tasks_on_workers ✅ — routes through provider

What fake providers wouldn't cover:

Side effect Why DI doesn't help
Checkpoint writes (begin_checkpoint, _atexit_checkpoint, periodic) Writes to GCS via write_checkpoint(db, remote_state_dir) — not routed through any injected service. You'd need a fake fsspec backend or a separate CheckpointWriter abstraction.
DB mutations via ControllerTransitions (_commit_assignments, _mark_task_unschedulable, replace_reservation_claims, prune_old_data) These write directly to the local SQLite DB. In dry-run you want to compute scheduling but not commit assignments. Transitions isn't an injected dependency — it's constructed internally from the DB.
Profile upload thread Skipped in dry-run via start() — not provider-related.

So you'd still need guards for: checkpoint writes, assignment commits, pruning, unschedulable marking, and reservation claim writes. That's 6-7 of the ~12 current if dry_run checks that wouldn't go away.

A middle path that might work better:

Instead of either extreme, you could introduce a SideEffectGate (or just a ReadOnlyTransitions wrapper) that the controller consults for all mutating operations:

@dataclass
class SideEffectGate:
    """Controls whether the controller may perform side effects."""
    enabled: bool = True

    def guard(self, label: str) -> bool:
        if not self.enabled:
            logger.info("[DRY-RUN] Suppressed: %s", label)
            return False
        return True

Then the checks become if self._gate.guard("checkpoint"): write_checkpoint(...) — still explicit, but the dry-run logic is centralized in one object rather than self._config.dry_run scattered across 12 call sites. It also makes it easy to add metrics (count of suppressed operations) or to make it togglable at runtime via the dashboard.

But honestly — for a feature this focused (operator inspects restored state, then Ctrl-C), the current if dry_run approach is fine. It's ~12 checks in a 1600-line file, they're all one-liners, and they're easy to grep for. The DI approach would require new abstractions (NoOpAutoscaler, NoOpCheckpointWriter, ReadOnlyTransitions) that don't earn their keep for a flag that's only used in manual debugging sessions.

The Codex review point is valid thoughrefresh() calls scale_down() and scale_down_if_idle() which terminate VMs. The dry-run guard at line 1595 needs to move before self._autoscaler.refresh(worker_status_map), or you should pass autoscaler=None in dry-run mode (which is arguably the one place where DI would be the cleanest fix).


@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Mar 25, 2026

@claude merge with main, fix the issue from codex review. identify any other missing mutations to external state that we might have left without dry_run guards

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 25, 2026

Claude finished @rjpower's task in 4m 58s —— View job


Merging with main, fixing Codex review issue, and auditing dry-run guards

  • Merge branch with main
  • Read current controller code and PR changes
  • Fix Codex review: move dry-run guard before autoscaler refresh
  • Audit for missing dry-run guards on external state mutations
  • Run pre-commit and tests
  • Push changes

What changed

Fixed the Codex review issue: moved the dry-run guard in _run_autoscaler_once() to before self._autoscaler.refresh(worker_status_map). Previously, refresh() ran before the guard and could call scale_down() / scale_down_if_idle() which terminate VMs — breaking the dry-run safety contract. Now the entire autoscaler cycle (both refresh and update) is skipped in dry-run mode.

Updated the test from test_dry_run_autoscaler_refresh_but_no_updatetest_dry_run_autoscaler_skipped_entirely to assert neither refresh nor update are called.

Audit: all external side effects are properly guarded

Side effect Guard location Status
Task dispatch (queue_assignments) _buffer_assignments L1380
Mark unschedulable (mark_task_unschedulable) _mark_task_unschedulable L1391
Kill tasks on workers kill_tasks_on_workers L1423
Reap stale workers (fail_workers_by_ids) _reap_stale_workers L1459
Provider sync (worker RPCs) _sync_all_execution_units L1483
Direct provider sync (K8s) _sync_direct_provider L1017
Autoscaler (VM changes) _run_autoscaler_once L1590 fixed
Checkpoint writes begin_checkpoint L1622
Atexit checkpoint _atexit_checkpoint L921
Periodic checkpoint _run_autoscaler_loop L977
Data pruning _maybe_prune L948
Reservation claim writes _run_scheduling L1241
Profile RPCs Thread not started (L845)

No missing guards found — the autoscaler was the only gap.

Testing

  • ./infra/pre-commit.py --all-files --fix — ✅ passed
  • uv run pytest tests/cluster/controller/test_dry_run.py -v — ✅ 7/7 passed
    · Branch multi/jXeCrjuj

github-actions Bot and others added 2 commits March 25, 2026 23:56
…e effects

Autoscaler.refresh() can call scale_down() and terminate VMs, which
violates the dry-run safety contract. Move the guard to skip the entire
autoscaler cycle (both refresh and update) in dry-run mode.

Updates the test to assert neither refresh nor update are called.

Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rjpower rjpower requested a review from yonromai March 26, 2026 00:05
rjpower added 2 commits March 25, 2026 17:22
- Add `iris cluster controller serve` subcommand that delegates to
  `run_controller_serve()`, reusing the config already loaded by the
  iris CLI group
- Extract serve body into `run_controller_serve()` so both the CLI
  subcommand and standalone entrypoint (Dockerfile/bootstrap) share it
- In dry-run mode: use a temp dir for local state (avoids /var/cache
  permission errors on dev machines), skip autoscaler/provider bundle
  creation (avoids requiring GCP credentials locally)
- Fix signal handler to reset on first Ctrl+C so second signal
  force-exits instead of re-entering a hanging shutdown
- Fix misleading "writing final checkpoint" log in dry-run mode
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

The dry-run control-flow itself looks good, and the autoscaler-refresh issue from the earlier Codex review is fixed. I found two follow-ups worth addressing before relying on this for operator workflows:

  • expected startup/config errors now escape as tracebacks instead of clean Click errors
  • BeginCheckpoint reports synthetic zero counts in dry-run instead of the restored DB state

Targeted validation passed locally:
uv run --package iris --group dev pytest lib/iris/tests/cluster/controller/test_dry_run.py -q (7 passed)

Generated with Codex


configure_logging(level=getattr(logging, log_level))

cluster_config = load_config(Path(config_file))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regresses the operator UX for expected startup failures. Before the refactor, serve() converted config/load/startup errors into click.ClickException, so a missing checkpoint or bad config produced a clean CLI error. Now load_config(...) here and the ValueErrors from run_controller_serve() bubble out of both CLI entrypoints, which means iris ... controller serve will dump a traceback for normal operator mistakes.

Can we wrap the shared startup path in click.ClickException again (both here and in iris.cli.cluster.controller_serve) so expected validation/startup failures stay user-facing?

Generated with Codex

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could, but it was already raising the full exception trace, so i think i'm fine with it this way

"""Pause loops and write a consistent SQLite checkpoint copy."""
if self._config.dry_run:
logger.info("[DRY-RUN] Skipping checkpoint write")
return ("dry-run", CheckpointResult(created_at=Timestamp.now(), job_count=0, task_count=0, worker_count=0))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In dry-run mode this returns a synthetic CheckpointResult with zero jobs/tasks/workers regardless of what was restored into the DB. That makes BeginCheckpoint lie about the controller state: iris cluster controller checkpoint will happily report 0 jobs, 0 tasks, 0 workers against a populated dry-run controller.

Can we either reject BeginCheckpoint in dry-run with a clear error, or compute the counts from the current DB snapshot while still skipping the actual checkpoint write?

Generated with Codex

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an explicit call a user would make in practice, so i think this is fine as is. the intention of dry-run mode is to ensure things like DB migrations worked properly, not exercise actual checkpoint logic

@rjpower rjpower merged commit 405009c into main Mar 26, 2026
62 checks passed
@rjpower rjpower deleted the multi/jXeCrjuj branch March 26, 2026 01:10
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
- Add `--dry-run` flag to `iris controller serve` that starts the
controller normally (loads config, restores from checkpoint, runs
scheduling loop) but suppresses all side effects
- Gates task dispatch, provider sync (worker RPCs), autoscaler VM
changes, checkpoint writes, pruning, and task kills
- Controller still serves dashboard + RPC so operators can inspect
restored state
- Useful for testing controller reloads and migrations locally before
reloading the entire cluster

Closes #4141
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

iris: add a --dry-run to controller startup

2 participants