|
| 1 | +# Iris Controller Dry-Run Mode: Codebase Analysis |
| 2 | + |
| 3 | +## Controller Startup Flow |
| 4 | + |
| 5 | +### Entry point: `lib/iris/src/iris/cluster/controller/main.py` |
| 6 | + |
| 7 | +The controller daemon is a Click CLI: |
| 8 | + |
| 9 | +``` |
| 10 | +cli -> serve command (main.py:34-239) |
| 11 | +``` |
| 12 | + |
| 13 | +`serve` takes `--host`, `--port`, `--scheduler-interval`, `--config`, `--log-level`, `--checkpoint-path`, `--checkpoint-interval`. |
| 14 | + |
| 15 | +Startup sequence: |
| 16 | +1. Load cluster config from YAML (`load_config()`) — `main.py:81` |
| 17 | +2. Resolve `remote_state_dir` and `local_state_dir` — `main.py:88-100` |
| 18 | +3. Restore or create local SQLite DB from checkpoint — `main.py:106-121` |
| 19 | +4. Create provider via `make_provider()` — `main.py:124` |
| 20 | +5. Create autoscaler (if not K8sTaskProvider) — `main.py:128-165` |
| 21 | +6. Create `ControllerConfig` dataclass — `main.py:182-193` |
| 22 | +7. Instantiate `Controller(config, provider, autoscaler, db)` — `main.py:196` |
| 23 | +8. Call `controller.start()` — `main.py:208` |
| 24 | +9. Register SIGTERM/SIGINT handler that checkpoints + stops — `main.py:218-237` |
| 25 | +10. Block on `stop_event.wait()` — `main.py:239` |
| 26 | + |
| 27 | +### Controller.__init__: `controller.py:712-799` |
| 28 | + |
| 29 | +Creates: `ControllerDB`, `LogStore`, `ControllerTransitions`, `Scheduler`, `BundleStore`, `ControllerServiceImpl` (RPC), `ControllerDashboard` (HTTP + dashboard UI). |
| 30 | + |
| 31 | +### Controller.start(): `controller.py:832-871` |
| 32 | + |
| 33 | +Spawns background threads depending on provider type: |
| 34 | + |
| 35 | +**Worker provider path** (standard): |
| 36 | +- `_run_scheduling_loop` thread — task assignment |
| 37 | +- `_run_provider_loop` thread — heartbeats/sync with workers |
| 38 | +- `_run_profile_loop` thread — periodic CPU profiling |
| 39 | +- `_run_autoscaler_loop` thread (if autoscaler configured) |
| 40 | +- uvicorn server thread (dashboard + RPC) |
| 41 | + |
| 42 | +**K8sTaskProvider path** (direct): |
| 43 | +- `_run_direct_provider_loop` thread — combined scheduling + sync |
| 44 | +- uvicorn server thread |
| 45 | + |
| 46 | +## Main Loop: Scheduling (`_run_scheduling_loop`) |
| 47 | + |
| 48 | +Location: `controller.py:921-936` |
| 49 | + |
| 50 | +Calls `_run_scheduling()` (`controller.py:1202-1315`) each cycle. This is the core scheduling function: |
| 51 | + |
| 52 | +1. **Read reservation claims** from DB — `controller.py:1224` |
| 53 | +2. **Cleanup stale claims** (dead workers, finished jobs) — `controller.py:1225` |
| 54 | +3. **Claim workers for reservations** — `controller.py:1226` → `_claim_workers_for_reservations()` at `controller.py:1151` |
| 55 | +4. **Read pending tasks** via `_schedulable_tasks()` — `controller.py:1232` |
| 56 | +5. **Read healthy workers** — `controller.py:1233` |
| 57 | +6. **Filter tasks**: check scheduling deadlines, reservation gates, per-job caps — `controller.py:1250-1273` |
| 58 | +7. **Inject reservation taints** on workers — `controller.py:1281-1282` |
| 59 | +8. **Create SchedulingContext** — `controller.py:1286-1291` |
| 60 | +9. **Phase 1**: Preference pass (steer reservation tasks to claimed workers) — `controller.py:1295` |
| 61 | +10. **Phase 2**: Normal scheduler `find_assignments()` — `controller.py:1298` |
| 62 | +11. **Buffer assignments** → `_buffer_assignments()` → `transitions.queue_assignments()` — `controller.py:1303` |
| 63 | +12. **Cache diagnostics** for unassigned jobs — `controller.py:1315` |
| 64 | + |
| 65 | +### Where task assignment happens |
| 66 | + |
| 67 | +- `_buffer_assignments()` at `controller.py:1359-1367`: calls `transitions.queue_assignments(command)` which writes ASSIGNED state to DB and enqueues dispatch batches |
| 68 | +- `transitions.queue_assignments()` at `transitions.py:840`: the actual DB mutation |
| 69 | + |
| 70 | +## Provider Sync / Heartbeat Loop (`_run_provider_loop`) |
| 71 | + |
| 72 | +Location: `controller.py:972-987` |
| 73 | + |
| 74 | +Calls `_sync_all_execution_units()` (`controller.py:1454-1553`): |
| 75 | + |
| 76 | +1. **Reap stale workers** — `controller.py:1460` |
| 77 | +2. **Drain dispatch batches** for all healthy workers — `transitions.drain_dispatch_all()` at `controller.py:1464` |
| 78 | +3. **Provider.sync(batches)** — sends RPCs to workers (start tasks, heartbeat, collect status) — `controller.py:1470` |
| 79 | +4. **Apply results**: `transitions.apply_heartbeats_batch()` for successes, `transitions.fail_heartbeat()` for failures — `controller.py:1475-1527` |
| 80 | +5. **Kill tasks** on workers if needed — `controller.py:1529-1530` |
| 81 | +6. **Notify autoscaler** of worker failures — `controller.py:1501-1523` |
| 82 | + |
| 83 | +### Direct provider path (`_sync_direct_provider`) |
| 84 | + |
| 85 | +Location: `controller.py:1005-1019` |
| 86 | + |
| 87 | +For K8sTaskProvider: |
| 88 | +1. `transitions.drain_for_direct_provider()` — gets tasks to run, running tasks, tasks to kill |
| 89 | +2. `provider.sync(batch)` — applies to K8s |
| 90 | +3. `transitions.apply_direct_provider_updates()` — applies results |
| 91 | + |
| 92 | +## Autoscaler Loop (`_run_autoscaler_loop`) |
| 93 | + |
| 94 | +Location: `controller.py:952-970` |
| 95 | + |
| 96 | +Calls `_run_autoscaler_once()` (`controller.py:1555-1572`): |
| 97 | +1. Build worker status map — `controller.py:1563` |
| 98 | +2. `autoscaler.refresh(worker_status_map)` — probes cloud API for VM status |
| 99 | +3. Compute demand entries — `controller.py:1566-1571` |
| 100 | +4. `autoscaler.update(demand_entries)` — scales up/down VMs |
| 101 | + |
| 102 | +Also runs periodic checkpoints if configured — `controller.py:966-970`. |
| 103 | + |
| 104 | +## Checkpoint/Archive System |
| 105 | + |
| 106 | +- `write_checkpoint()` at `checkpoint.py:76-130`: SQLite hot-backup → upload to `{remote_state_dir}/controller-state/{epoch_ms}/` |
| 107 | +- `download_checkpoint_to_local()` at `checkpoint.py:170-218`: find latest checkpoint, download to local DB dir |
| 108 | +- `begin_checkpoint()` at `controller.py:1587-1603`: sets `_checkpoint_in_progress=True`, acquires heartbeat lock, writes checkpoint, unsets flag |
| 109 | +- Periodic checkpoint runs in autoscaler loop — `controller.py:966-970` |
| 110 | +- Atexit checkpoint — `controller.py:913-919` |
| 111 | + |
| 112 | +## Recommended Approach for --dry-run Flag |
| 113 | + |
| 114 | +### What dry-run should do |
| 115 | + |
| 116 | +Show what the controller *would* do without actually doing it. The controller should: |
| 117 | +- ✅ Load config, restore DB, create all objects normally |
| 118 | +- ✅ Run the scheduling loop to compute assignments |
| 119 | +- ✅ Probe workers (heartbeats) to discover cluster state |
| 120 | +- ❌ NOT dispatch task assignments to workers (suppress `transitions.queue_assignments()`) |
| 121 | +- ❌ NOT start/stop VMs via autoscaler (`autoscaler.update()` should be read-only or skipped) |
| 122 | +- ❌ NOT kill tasks on workers |
| 123 | +- ❌ NOT write checkpoints (state hasn't changed) |
| 124 | +- ❌ NOT modify reservation claims in DB |
| 125 | +- ✅ Log what would happen (assignments, autoscaler decisions) |
| 126 | + |
| 127 | +### Where to add the flag |
| 128 | + |
| 129 | +1. **CLI**: Add `--dry-run` to the `serve` command in `main.py:34`. |
| 130 | + |
| 131 | +2. **ControllerConfig**: Add `dry_run: bool = False` field at `controller.py:604`. |
| 132 | + |
| 133 | +3. **Controller**: Gate side-effectful operations on `self._config.dry_run`: |
| 134 | + |
| 135 | + | Method | Location | What to gate | |
| 136 | + |--------|----------|-------------| |
| 137 | + | `_buffer_assignments()` | `controller.py:1359` | Skip `transitions.queue_assignments()`, log assignments instead | |
| 138 | + | `_claim_workers_for_reservations()` | `controller.py:1151` | Skip `transitions.replace_reservation_claims()`, log claims | |
| 139 | + | `_cleanup_stale_claims()` | `controller.py:1119` | Skip DB writes | |
| 140 | + | `kill_tasks_on_workers()` | `controller.py:1392` | Skip `buffer_kill`/`buffer_direct_kill` | |
| 141 | + | `_mark_task_unschedulable()` | `controller.py:1369` | Skip `transitions.mark_task_unschedulable()` | |
| 142 | + | `_sync_all_execution_units()` | `controller.py:1454` | Run heartbeats for probing but skip applying dispatch; or skip entirely | |
| 143 | + | `_sync_direct_provider()` | `controller.py:1005` | Skip `provider.sync()` | |
| 144 | + | `_run_autoscaler_once()` | `controller.py:1555` | Skip `autoscaler.update()` (scale decisions), keep `autoscaler.refresh()` for status | |
| 145 | + | `begin_checkpoint()` | `controller.py:1587` | Skip entirely | |
| 146 | + | `_maybe_prune()` | `controller.py:938` | Skip entirely | |
| 147 | + |
| 148 | +4. **Logging in dry-run mode**: In `_run_scheduling()`, after computing `all_assignments`, log them: |
| 149 | + ```python |
| 150 | + if self._config.dry_run: |
| 151 | + for task_id, worker_id in all_assignments: |
| 152 | + logger.info("[DRY-RUN] Would assign %s -> %s", task_id, worker_id) |
| 153 | + return |
| 154 | + ``` |
| 155 | + |
| 156 | +### Alternative: read-only DB wrapper |
| 157 | + |
| 158 | +Instead of scattering `if dry_run` checks, wrap `ControllerTransitions` with a subclass that logs mutations but doesn't execute them. This is cleaner but requires more upfront work since `ControllerTransitions` is not designed for composition. |
| 159 | + |
| 160 | +### Simplest viable approach |
| 161 | + |
| 162 | +The easiest path with minimal code changes: |
| 163 | + |
| 164 | +1. Add `dry_run: bool` to `ControllerConfig` |
| 165 | +2. In `_run_scheduling()` at line 1301: if dry_run, log assignments and return early instead of calling `_buffer_assignments()` |
| 166 | +3. In `_claim_workers_for_reservations()`: if dry_run, skip the `transitions.replace_reservation_claims()` call |
| 167 | +4. In `_run_autoscaler_once()`: skip `autoscaler.update()` |
| 168 | +5. In `_sync_all_execution_units()`: skip the `provider.sync()` call (no RPCs to workers) |
| 169 | +6. Skip checkpoints and pruning |
| 170 | + |
| 171 | +This keeps the scheduling logic running normally (reads are fine) but suppresses all writes and RPCs. The dashboard + RPC server still runs so you can inspect state. |
| 172 | + |
| 173 | +### Open questions |
| 174 | + |
| 175 | +1. Should dry-run mode still accept new job submissions via RPC? Probably yes for testing, but the jobs would never be dispatched. |
| 176 | +2. Should dry-run mode replay from a checkpoint without probing workers? This would be useful for offline analysis ("what would the controller do with this checkpoint?"). |
| 177 | +3. Should the autoscaler compute and *log* demand without acting? Or skip entirely? |
0 commit comments