|
1 | 1 | --- |
2 | 2 | name: iris-controller-debug |
3 | | -description: Debug Iris controller state using the live SQLite database. Use when investigating stuck jobs, resource leaks, scheduling failures, or worker issues. |
| 3 | +description: Debug Iris controller state using offline checkpoint snapshots and the process RPC. Use when investigating stuck jobs, resource leaks, scheduling failures, or worker issues. |
4 | 4 | --- |
5 | 5 |
|
6 | 6 | # Skill: Iris Controller Debug |
7 | 7 |
|
8 | | -Debug Iris controller issues by querying the live SQLite database on the controller VM. |
| 8 | +Debug Iris controller issues by triggering a fresh checkpoint, downloading it, and querying offline. Use the `/system/process` RPC (via `iris process`) for profiling; SSH is acceptable as a fallback when RPC doesn't cover your needs. |
9 | 9 |
|
10 | 10 | Read first: @lib/iris/AGENTS.md |
11 | 11 |
|
12 | 12 | ## Access Pattern |
13 | 13 |
|
14 | | -The DB is inside a Docker container on a GCP VM. Pipe a Python script through SSH: |
| 14 | +**Always debug offline against a checkpoint copy — never run queries against the live controller DB.** |
| 15 | +Trigger a fresh checkpoint on-demand and download it: |
15 | 16 |
|
16 | 17 | ```bash |
17 | | -cat <<'PYEOF' | gcloud compute ssh <VM> --zone=<ZONE> --project=<PROJECT> \ |
18 | | - --command="cat > /tmp/query.py && docker cp /tmp/query.py <CONTAINER>:/tmp/ && docker exec <CONTAINER> python3 /tmp/query.py" |
| 18 | +# Trigger a fresh checkpoint (uses the BeginCheckpoint RPC) |
| 19 | +# The --config flag selects the cluster; adjust as needed. |
| 20 | +uv run iris --config lib/iris/examples/marin.yaml cluster controller checkpoint |
| 21 | +# Example output: |
| 22 | +# Checkpoint DB written: gs://marin-us-central2/iris/marin/state/controller-state/checkpoint-1773533644027.sqlite3 |
| 23 | +# Jobs: 417 |
| 24 | +# Tasks: 46790 |
| 25 | +# Workers: 243 |
| 26 | + |
| 27 | +# Download the checkpoint (use the path from the output above) |
| 28 | +gcloud storage cp gs://marin-us-central2/iris/marin/state/controller-state/checkpoint-<EPOCH_MS>.sqlite3 /tmp/controller.sqlite3 |
| 29 | + |
| 30 | +# Query offline |
| 31 | +python3 -c " |
19 | 32 | import sqlite3 |
20 | | -conn = sqlite3.connect("/tmp/iris/controller-logs/controller.sqlite3") |
| 33 | +conn = sqlite3.connect('/tmp/controller.sqlite3') |
21 | 34 | conn.row_factory = sqlite3.Row |
22 | 35 | # ... queries ... |
23 | | -PYEOF |
| 36 | +" |
24 | 37 | ``` |
25 | 38 |
|
26 | | -**Find the container ID** with `docker ps` on the VM — look for the iris controller container. |
| 39 | +If the state has changed and you need another snapshot, trigger another checkpoint with `uv run iris --config lib/iris/examples/marin.yaml cluster controller checkpoint` and re-download. **Do not SSH into the controller VM to run scripts against the live database** — a slow query can stall the controller and break other users. |
27 | 40 |
|
28 | 41 | **Production defaults** (verify at session start): |
29 | | -- VM: `iris-controller-marin`, zone `us-central1-a`, project `hai-gcp-models` |
30 | | -- DB path: `/tmp/iris/controller-logs/controller.sqlite3` |
| 42 | +- Checkpoint location: `gs://marin-us-central2/iris/marin/state/controller-state/latest.sqlite3` |
| 43 | +- Timestamped checkpoints: `gs://marin-us-central2/iris/marin/state/controller-state/checkpoint-<epoch_ms>.sqlite3` |
| 44 | + |
| 45 | +## Profiling |
| 46 | + |
| 47 | +Prefer the `iris process` CLI which talks to the controller via the `/system/process` RPC. If the RPC endpoints don't cover what you need, SSH is acceptable as a fallback: |
| 48 | + |
| 49 | +```bash |
| 50 | +# Thread dump (instant snapshot of all threads) |
| 51 | +uv run iris process profile threads |
| 52 | + |
| 53 | +# CPU profile (writes speedscope JSON, 10s default) |
| 54 | +uv run iris process profile cpu --duration 10 --output /tmp/profile.speedscope.json |
| 55 | + |
| 56 | +# Memory profile (writes flamegraph HTML) |
| 57 | +uv run iris process profile mem --duration 10 --output /tmp/profile.html |
| 58 | + |
| 59 | +# Target a specific worker instead of the controller |
| 60 | +uv run iris process profile threads --worker <WORKER_ID> |
| 61 | + |
| 62 | +# Process status (host info, resource usage) |
| 63 | +uv run iris process status |
| 64 | +``` |
| 65 | + |
| 66 | +Controller logs can also be fetched via the CLI: |
| 67 | + |
| 68 | +```bash |
| 69 | +# Tail controller logs |
| 70 | +uv run iris process logs --max-lines 200 |
| 71 | + |
| 72 | +# Filter for slow-path warnings |
| 73 | +uv run iris process logs --substring "Slow " |
| 74 | +``` |
31 | 75 |
|
32 | 76 | ## Schema |
33 | 77 |
|
@@ -79,8 +123,11 @@ These are real issues in the codebase that will mislead you if you don't know ab |
79 | 123 |
|
80 | 124 | 3. **Fleet view zone display**: `FleetTab.vue:82` reads `metadata.gceZone` (never populated) instead of `metadata.attributes["zone"]`. The dashboard shows blank zones even when workers have zone attributes. |
81 | 125 |
|
| 126 | +4. **Heartbeat thread stall on gcloud subprocess**: The heartbeat loop calls `notify_worker_failed` → `scale_down` → `terminate` which runs a synchronous `gcloud compute tpus tpu-vm delete` (`gcp.py:591`). If the gcloud API hangs, **all task dispatch stops** because dispatches are delivered via heartbeats. Symptoms: `dispatch_queue` growing, tasks stuck in ASSIGNED (9), stale `last_heartbeat_ms` across all workers. The autoscaler thread has the same exposure independently. Check with `py-spy dump` — look for `subprocess.run` → `terminate` on the heartbeat or autoscaler thread. The stuck gcloud process can be killed to unblock (#3678). |
| 127 | + |
82 | 128 | ## Rules |
83 | 129 |
|
84 | | -- **NEVER modify the database without explicit user approval.** Read-only queries first; writes only as a last resort with user consent. Always run a verification query after any write. |
| 130 | +- **NEVER run scripts or queries against the live controller DB.** Always work offline against a downloaded checkpoint. A slow or locking query on the live DB can stall the controller and break other users. |
| 131 | +- **Prefer `iris process profile` over SSH for profiling.** It uses the `/system/process` RPC and avoids direct access to the controller VM. SSH is acceptable as a fallback when the RPC endpoints don't cover what you need. |
| 132 | +- **NEVER modify the database without explicit user approval.** Read-only queries on the local checkpoint copy only; writes only as a last resort with user consent on a fresh checkpoint. |
85 | 133 | - **NEVER restart the controller or Docker container** — this kills all running jobs cluster-wide. |
86 | | -- Always verify VM name, zone, and container ID at the start of each session. |
|
0 commit comments