Skip to content

Commit 6e702d8

Browse files
Update iris debugger skill (#3681)
Update Iris debugging skill. --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Rafal Wojdyla <ravwojdyla@users.noreply.github.com>
1 parent 1de735c commit 6e702d8

File tree

1 file changed

+59
-12
lines changed
  • .agents/skills/iris-controller-debug

1 file changed

+59
-12
lines changed

.agents/skills/iris-controller-debug/SKILL.md

Lines changed: 59 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,77 @@
11
---
22
name: iris-controller-debug
3-
description: Debug Iris controller state using the live SQLite database. Use when investigating stuck jobs, resource leaks, scheduling failures, or worker issues.
3+
description: Debug Iris controller state using offline checkpoint snapshots and the process RPC. Use when investigating stuck jobs, resource leaks, scheduling failures, or worker issues.
44
---
55

66
# Skill: Iris Controller Debug
77

8-
Debug Iris controller issues by querying the live SQLite database on the controller VM.
8+
Debug Iris controller issues by triggering a fresh checkpoint, downloading it, and querying offline. Use the `/system/process` RPC (via `iris process`) for profiling; SSH is acceptable as a fallback when RPC doesn't cover your needs.
99

1010
Read first: @lib/iris/AGENTS.md
1111

1212
## Access Pattern
1313

14-
The DB is inside a Docker container on a GCP VM. Pipe a Python script through SSH:
14+
**Always debug offline against a checkpoint copy — never run queries against the live controller DB.**
15+
Trigger a fresh checkpoint on-demand and download it:
1516

1617
```bash
17-
cat <<'PYEOF' | gcloud compute ssh <VM> --zone=<ZONE> --project=<PROJECT> \
18-
--command="cat > /tmp/query.py && docker cp /tmp/query.py <CONTAINER>:/tmp/ && docker exec <CONTAINER> python3 /tmp/query.py"
18+
# Trigger a fresh checkpoint (uses the BeginCheckpoint RPC)
19+
# The --config flag selects the cluster; adjust as needed.
20+
uv run iris --config lib/iris/examples/marin.yaml cluster controller checkpoint
21+
# Example output:
22+
# Checkpoint DB written: gs://marin-us-central2/iris/marin/state/controller-state/checkpoint-1773533644027.sqlite3
23+
# Jobs: 417
24+
# Tasks: 46790
25+
# Workers: 243
26+
27+
# Download the checkpoint (use the path from the output above)
28+
gcloud storage cp gs://marin-us-central2/iris/marin/state/controller-state/checkpoint-<EPOCH_MS>.sqlite3 /tmp/controller.sqlite3
29+
30+
# Query offline
31+
python3 -c "
1932
import sqlite3
20-
conn = sqlite3.connect("/tmp/iris/controller-logs/controller.sqlite3")
33+
conn = sqlite3.connect('/tmp/controller.sqlite3')
2134
conn.row_factory = sqlite3.Row
2235
# ... queries ...
23-
PYEOF
36+
"
2437
```
2538

26-
**Find the container ID** with `docker ps` on the VM — look for the iris controller container.
39+
If the state has changed and you need another snapshot, trigger another checkpoint with `uv run iris --config lib/iris/examples/marin.yaml cluster controller checkpoint` and re-download. **Do not SSH into the controller VM to run scripts against the live database** — a slow query can stall the controller and break other users.
2740

2841
**Production defaults** (verify at session start):
29-
- VM: `iris-controller-marin`, zone `us-central1-a`, project `hai-gcp-models`
30-
- DB path: `/tmp/iris/controller-logs/controller.sqlite3`
42+
- Checkpoint location: `gs://marin-us-central2/iris/marin/state/controller-state/latest.sqlite3`
43+
- Timestamped checkpoints: `gs://marin-us-central2/iris/marin/state/controller-state/checkpoint-<epoch_ms>.sqlite3`
44+
45+
## Profiling
46+
47+
Prefer the `iris process` CLI which talks to the controller via the `/system/process` RPC. If the RPC endpoints don't cover what you need, SSH is acceptable as a fallback:
48+
49+
```bash
50+
# Thread dump (instant snapshot of all threads)
51+
uv run iris process profile threads
52+
53+
# CPU profile (writes speedscope JSON, 10s default)
54+
uv run iris process profile cpu --duration 10 --output /tmp/profile.speedscope.json
55+
56+
# Memory profile (writes flamegraph HTML)
57+
uv run iris process profile mem --duration 10 --output /tmp/profile.html
58+
59+
# Target a specific worker instead of the controller
60+
uv run iris process profile threads --worker <WORKER_ID>
61+
62+
# Process status (host info, resource usage)
63+
uv run iris process status
64+
```
65+
66+
Controller logs can also be fetched via the CLI:
67+
68+
```bash
69+
# Tail controller logs
70+
uv run iris process logs --max-lines 200
71+
72+
# Filter for slow-path warnings
73+
uv run iris process logs --substring "Slow "
74+
```
3175

3276
## Schema
3377

@@ -79,8 +123,11 @@ These are real issues in the codebase that will mislead you if you don't know ab
79123

80124
3. **Fleet view zone display**: `FleetTab.vue:82` reads `metadata.gceZone` (never populated) instead of `metadata.attributes["zone"]`. The dashboard shows blank zones even when workers have zone attributes.
81125

126+
4. **Heartbeat thread stall on gcloud subprocess**: The heartbeat loop calls `notify_worker_failed``scale_down``terminate` which runs a synchronous `gcloud compute tpus tpu-vm delete` (`gcp.py:591`). If the gcloud API hangs, **all task dispatch stops** because dispatches are delivered via heartbeats. Symptoms: `dispatch_queue` growing, tasks stuck in ASSIGNED (9), stale `last_heartbeat_ms` across all workers. The autoscaler thread has the same exposure independently. Check with `py-spy dump` — look for `subprocess.run``terminate` on the heartbeat or autoscaler thread. The stuck gcloud process can be killed to unblock (#3678).
127+
82128
## Rules
83129

84-
- **NEVER modify the database without explicit user approval.** Read-only queries first; writes only as a last resort with user consent. Always run a verification query after any write.
130+
- **NEVER run scripts or queries against the live controller DB.** Always work offline against a downloaded checkpoint. A slow or locking query on the live DB can stall the controller and break other users.
131+
- **Prefer `iris process profile` over SSH for profiling.** It uses the `/system/process` RPC and avoids direct access to the controller VM. SSH is acceptable as a fallback when the RPC endpoints don't cover what you need.
132+
- **NEVER modify the database without explicit user approval.** Read-only queries on the local checkpoint copy only; writes only as a last resort with user consent on a fresh checkpoint.
85133
- **NEVER restart the controller or Docker container** — this kills all running jobs cluster-wide.
86-
- Always verify VM name, zone, and container ID at the start of each session.

0 commit comments

Comments
 (0)