All subcommands have --help. Use it.
Two connection modes: --config=PATH (auto-tunnels) or --controller-url=URL (manual tunnel).
iris cluster start|stop|restart|status
iris cluster dashboard # open tunnel, print URL, block
iris cluster dashboard-proxy # local proxy to remote controller (no tunnel needed)iris cluster controller restart restarts the controller only (seconds of downtime, workers unaffected).
iris cluster restart tears down everything — controller + all workers. All jobs die. Never run the full iris cluster restart without explicit user approval.
Workflow: dry-run locally (iris cluster controller serve --dry-run) -> capture baseline (iris cluster status) -> restart -> verify.
If checkpoint times out: iris cluster controller restart --skip-checkpoint (restores from last periodic checkpoint; some recent state may be lost).
iris job run -- python train.py # submit + stream logs
iris job list --state running # filter by state
iris job logs /user/job-name -f # follow logs
iris job stop /user/job-name # kill job + children
iris job summary /user/job-name # per-task state, exit, duration, peak memory
iris job summary /user/job-name --json # same, machine-readable
iris job bug-report /user/job-name # structured diagnostic dump--memorynot--ram— unrecognized flags silently pass through to the command string.-e KEY VALUEuses two positional args. If$VALUEis unset, the parser eats the next token. Always quote:-e KEY "${VALUE}".--extra gpuinstalls CUDA jaxlib but does NOT request GPU hardware. Need both--gpu H100x8 --extra gpu.--reserveholds capacity for scheduling only — does not attach accelerator devices. Use--tpu/--gpuon the task that needs hardware.executor_mainparent jobs (e.g., canary ferries) submit GPU sub-tasks via Fray. The parent must be CPU-only (--cpu 1 --memory 16g), otherwise it hogs the GPU node and deadlocks.
iris task exec /user/job/0 -- bash # shell into running container
iris task exec /user/job/0 -- python -c "import jax; print(jax.devices())"Default timeout is 60s. Use --timeout 300 for slow commands, --timeout -1 for no timeout (last resort).
The exec session is non-interactive and buffers output. To run a command that survives disconnect, wrap with nohup + &:
iris task exec /user/job/0 -- bash -c "nohup bash -c 'your-command > /tmp/out.log 2>&1' &"
iris task exec /user/job/0 -- cat /tmp/out.log # check lateriris process status # controller resource usage
iris process status -t /system/worker/<id> # worker process status
iris process logs -f # follow controller logs
iris process logs --level WARNING # filter by level
iris process profile threads # thread dump (prints to stdout)
iris process profile cpu -d 10 # 10s CPU profile (writes .speedscope.json)
iris process profile mem # memory flamegraph (writes .html)
iris process profile cpu -t /user/job/0 # profile a running task containerPrefer iris process profile over SSH for profiling — it uses the /system/process RPC and avoids direct VM access. SSH is a fallback only when the RPC doesn't cover your needs.
iris rpc controller get-scheduler-state # pending queue, resource constraints, priority bands
iris rpc controller get-autoscaler-status # per-group demand, backoff, failures, quota
iris rpc controller get-provider-status # scheduling events, cluster capacity
iris cluster vm status # scale groups with slice countsPriority bands: PRIORITY_BAND_INTERACTIVE (default), PRIORITY_BAND_PRODUCTION (can preempt interactive), PRIORITY_BAND_BATCH (preemptible). See docs/priority-bands.md for the user-facing guide on when to pick each band.
The controller exposes its SQLite DB via RPC:
iris query "SELECT state, count(*) FROM jobs GROUP BY state"
iris query "SELECT state, count(*) FROM tasks GROUP BY state" -f jsonNever modify the controller database without explicit user approval — read-only queries only, even on offline checkpoints.
State codes: 1=PENDING, 2=BUILDING, 3=RUNNING, 4=SUCCEEDED, 5=FAILED, 6=KILLED, 7=WORKER_FAILED, 8=UNSCHEDULABLE, 9=ASSIGNED (tasks only), 10=PREEMPTED (tasks only).
- Active states: 2 (BUILDING), 3 (RUNNING), and 9 (ASSIGNED) — not just RUNNING. Forgetting ASSIGNED causes resource attribution misdiagnosis.
- Committed resources:
workershascommitted_cpu_millicores,committed_mem_bytes, etc. Total capacity is inmetadata_proto(serialized protobuf). Available = capacity - committed. request_proto: serialized protobuf injobs.request_proto. You need protobuf to decode — plain SQL cannot inspect task constraints.
-- Failed jobs with errors
SELECT job_id, error, exit_code FROM jobs WHERE state=5 ORDER BY submitted_at_ms DESC LIMIT 10;
-- Quota-blocked scale groups
SELECT name, consecutive_failures, quota_reason FROM scaling_groups
WHERE consecutive_failures > 0 OR quota_reason != '';
-- Active slices (GCP)
SELECT slice_id, lifecycle, scale_group, worker_ids FROM slices WHERE lifecycle='ready';
-- Task attempt history (debugging retries)
SELECT task_id, attempt_id, state, exit_code, error FROM task_attempts
WHERE task_id LIKE '%<job_fragment>%' ORDER BY attempt_id;
-- What the controller has been doing
SELECT kind, count(*) FROM txn_log GROUP BY kind ORDER BY count(*) DESC LIMIT 10;Full table list: iris query "SELECT name FROM sqlite_master WHERE type='table'".
For slow queries, trigger a checkpoint and query offline. Never run expensive queries against the live DB — they stall the controller.
iris cluster controller checkpoint # trigger fresh checkpoint
# Download the checkpoint file (path printed by command above)
sqlite3 /tmp/controller.sqlite3 "SELECT ..."iris login # authenticate, store JWT locally
iris rpc controller list-users # active users with task/job counts
iris user budget list # per-user budget limits
iris key create --name ci-bot # create API key
iris key list / iris key revoke # manage API keys| Symptom | Diagnostic |
|---|---|
| Job stuck PENDING | iris rpc controller get-scheduler-state for constraints. Check quota: iris query "SELECT name, consecutive_failures, quota_reason FROM scaling_groups WHERE quota_reason != ''" |
| Workers not joining (GCP) | iris cluster vm status for slice lifecycle. SSH to VM, check bootstrap logs. |
| Autoscaler not scaling | iris rpc controller get-autoscaler-status — check backoff_until_ms, consecutive_failures. |
| Task retrying | iris job bug-report /user/job — full attempt history with per-attempt errors. |
| Task failed with exit 137 / suspected OOM | iris job summary /user/job — per-task peak memory + exit code. If most shards peak near the container memory limit, raise --memory on resubmit. |
| Dashboard unreachable | Verify tunnel is alive. curl -sf http://localhost:10000/health. |
-
Committed resource leak (
transitions.py):_decommit_worker_resources()can miss certain task termination paths, leaving stale committed resources on workers. Symptom: workers show high committed CPU/memory/TPU with zero active tasks. Detect by joiningworkersagainst active tasks intask_attempts. -
Heartbeat thread stall on gcloud subprocess (#3678): The heartbeat loop calls
notify_worker_failed->scale_down->terminatewhich runs a synchronousgcloud compute tpus tpu-vm delete. If the gcloud API hangs, all task dispatch stops because dispatches are delivered via heartbeats. Symptoms:dispatch_queuegrowing, tasks stuck in ASSIGNED (9), stalelast_heartbeat_ms. Diagnose withpy-spy dump— look forsubprocess.run->terminateon the heartbeat thread. Kill the stuck gcloud process to unblock.
# SSH tunnel (IAP)
gcloud compute ssh iris-controller-marin --zone=us-central1-a \
--project=hai-gcp-models --tunnel-through-iap -- -L 10000:localhost:10000 -N
# Then: iris --controller-url=http://localhost:10000 ...
# Or config-based auto-tunnel: iris --config=lib/iris/examples/marin.yaml ...Configs: marin.yaml (production), marin-dev.yaml (dev, smaller scale caps).
# Controller VM
gcloud compute instances list --project=hai-gcp-models \
--filter="labels.iris-marin-controller=true" --format="table(name,zone,status)"
# Iris-managed worker VMs
gcloud compute instances list --project=hai-gcp-models \
--filter="labels.iris-marin-managed=true" --format="table(name,zone,status)"
# TPU VMs (all zones)
gcloud compute tpus tpu-vm list --project=hai-gcp-models --zone=- \
--format="table(name,zone,state,acceleratorType)" | head -30Trigger patterns (bad node, not a code bug):
RuntimeError: No accelerator found. Please run on a TPU or GPU.FAILED_PRECONDITIONDevice or resource busy
Recovery: extract worker IP from logs -> map to VM name (gcloud compute tpus tpu-vm list --zone <ZONE> --format="table(name,networkEndpoints[0].ipAddress)") -> delete bad node (gcloud compute tpus tpu-vm delete <NAME> --zone <ZONE> --quiet) -> resubmit job.
Only delete the specific bad node. If multiple nodes fail simultaneously or the same node fails again, escalate to the user.
State dir: gs://marin-us-central2/iris/<cluster>/state/ — contains bundles/ (code packages), controller-state/ (SQLite checkpoints), logs/ (Parquet).
- Quota is the primary scaling bottleneck. The autoscaler backs off exponentially per scale group. Check with
iris rpc controller get-autoscaler-status. - Stuck TPU VMs. Occasionally a TPU VM gets stuck in DELETING for days. Check:
gcloud compute tpus tpu-vm list --project=hai-gcp-models --zone=- --filter="state=DELETING". - Reservation system. Accelerator jobs create
:reservation:sub-jobs that hold slices. View withiris query "SELECT * FROM reservation_claims".
# Port-forward (keep terminal open)
kubectl --kubeconfig ~/.kube/coreweave-iris \
port-forward -n iris svc/iris-controller-svc 10000:10000
# Then: iris --controller-url=http://localhost:10000 ...
# Or config-based: iris --config=lib/iris/examples/coreweave.yaml ...Namespaces: iris (main dev), iris-ci (persistent CI), iris-canary (GPU canary, ephemeral).
Configs: coreweave.yaml, coreweave-ci.yaml, coreweave-canary.yaml.
On CW, there are no persistent worker daemons. The controller dispatches tasks directly as K8s pods. list-workers returns empty. The workers SQL table is empty. Use iris rpc controller get-kubernetes-cluster-status for pod/node status.
kci get pods -n iris -l iris.managed=true # task pods
kci get nodepools # all nodepools (cluster-scoped)
kci get events -n iris --sort-by=.lastTimestamp | tail -30
kci logs -n iris deployment/iris-controller -f # controller logs(kci = kubectl --kubeconfig ~/.kube/coreweave-iris)
# Check status (columns: TARGET, QUEUED, INPROGRESS, CURRENT, CAPACITY, QUOTA)
kci get nodepools
# Scale — do NOT use `kubectl scale --replicas`, that's the wrong field
kci patch nodepool <name> --type=merge -p '{"spec":{"targetNodes":N}}'
# Delete
kci delete nodepool <name>Stuck deletion (autoscaler fights deletion or node mid-delivery):
kci scale deployment iris-controller -n iris --replicas=0 # stop autoscaler
kci patch nodepool <name> --type=merge -p '{"spec":{"autoscaling":false,"targetNodes":0}}'
# If still stuck (mid-delivery): remove finalizer
kci patch nodepool <name> --type=json -p '[{"op":"remove","path":"/metadata/finalizers"}]'
kci delete nodepool <name>iris cluster stop deletes pods but NodePools survive (scale to zero via CW autoscaler). To avoid lingering GPU costs:
iris cluster stop
kci delete nodepool -l iris-<label_prefix>-managed=true- NodePools survive
cluster stop. Delete explicitly to avoid lingering GPU costs. list-workersreturns empty. KubernetesProvider dispatches pods directly — no persistent workers. Useiris rpc controller get-kubernetes-cluster-status.list-tasksrequiresjob_id. Calling without it throwsConnectError: job_id is required.cluster startalways rebuilds+pushes images. Needsdocker login ghcr.iowithwrite:packagesPAT.- Konnectivity agent.
kubectl port-forwardreturns 500 untilkonnectivity-agentpods are running (~18-30s after node provisions). kubectl scale --replicasis wrong for NodePools. Usekci patch nodepool ... '{"spec":{"targetNodes":N}}'.
- Check account-wide H100 contention, not just
iris-canary:kci get nodepools -A. Ifiris-ci-h100-8x(or any other pool) is already holding the zone's H100 quota atmaxNodes=1, the canary'siris-canary-h100-8xcannot scale up — CW account caps total H100 in US-WEST-04A. - Workaround: reuse the CI nodepool (point
coreweave-canary.yamlh100-8xselector atiris-iris-ci-managed=trueand coordinate with iris-ci) or scaleiris-ci-h100-8xto 0 before the canary runs. - Root fix: CW support ticket to raise
gd-8xh100ib-i128account quota ≥2 in US-WEST-04A.
| Workflow | Trigger | What |
|---|---|---|
marin-canary-ferry.yaml |
Daily 6AM UTC | TPU canary on GCP (marin-dev.yaml) |
marin-canary-ferry-cw.yaml |
Daily 10AM UTC | GPU canary on CW — shares iris-ci controller + H100 nodepool with iris-coreweave-ci.yaml (concurrency group iris-coreweave-ci-shared) |
iris-cloud-smoke-gcp.yaml |
PRs touching lib/iris/ |
GCP smoke test (ephemeral cluster) |
iris-coreweave-ci.yaml |
PRs touching lib/iris/ |
CW integration tests (warm cluster) |
# Trigger manually
gh workflow run "<workflow name>" -R marin-community/marin --ref main
# View failed run
gh run view <run-id> -R marin-community/marin --log-failed | tail -50| Resource | Time |
|---|---|
| CW CPU node | ~14 min |
| CW H100 bare-metal | ~20 min |
| CW first training step (from zero) | ~25-30 min |