Halve Ray cluster minimums, boost Iris capacity#4175
Conversation
Analyze controller startup flow, scheduling loop, worker sync, autoscaler, and checkpoint systems. Identify all side-effectful operations that need gating for a --dry-run flag.
Shift capacity from Ray to Iris to accelerate migration: Ray min_workers halved across all clusters (training + vllm + staging). Freed capacity given to Iris as min_slices: - tpu_v5p_8: 0 → 8 (absorbed from us-central1 + us-east5-a) - tpu_v4_8: 0 → 2 (absorbed from us-central2) - tpu_v5e_4: 1 → 3 (absorbed from eu-west4) - tpu_v5e_128: 0 → 1 (absorbed from eu-west4) - tpu_v6e_128: 0 → 1 (absorbed from eu-west4-a) Iris controller VM upgraded from e2-standard-4 (16GB) to e2-highmem-4 (32GB) in both prod and dev configs to address memory pressure.
|
Claude finished @rjpower's task in 1m 28s —— View job Review
Halves Ray Findings:
No bugs or correctness issues in the config changes themselves. |
| @@ -0,0 +1,177 @@ | |||
| # Iris Controller Dry-Run Mode: Codebase Analysis | |||
There was a problem hiding this comment.
This research document appears to be from a separate task and shouldn't be committed to the repo root. Consider removing it from this PR (it was added in commit 1a6bc2e).
- Halved `min_workers` across all Ray clusters (12 files: training, vllm, staging) to free TPU capacity - Transferred freed capacity to Iris `min_slices` in `marin.yaml`: v5p-8 +8, v4-8 +2, v5e-4 +2, v5e-128 +1, v6e-128 +1 - Upgraded Iris controller VM from `e2-standard-4` (16GB) to `e2-highmem-4` (32GB) in prod and dev ## Ray min_workers changes | Cluster | Node type | Before | After | |---|---|---|---| | us-central1 | tpu_worker (v5p-8) | 1 | 0 | | us-central1 | tpu_slice_v5p_8 | 12 | 6 | | us-central1 | tpu_slice_v5p_16 | 1 | 0 | | us-central1 | tpu_slice_v5p_32 | 1 | 0 | | us-central1 | tpu_slice_v5p_64 | 1 | 0 | | us-central2 | tpu_worker (v4-8) | 4 | 2 | | us-central2-staging | tpu_worker (v4-8) | 4 | 2 | | eu-west4 | tpu_worker (v5e-4) | 4 | 2 | | eu-west4 | tpu_slice_v5e_128 | 1 | 0 | | eu-west4-a | tpu_slice_v6e_128 | 2 | 1 | | us-east5-a | tpu_worker (v5p-8) | 8 | 4 | | us-east5-a | tpu_slice_v5p_8 | 8 | 4 | | us-east5-a-vllm | tpu_worker | 1 | 0 | | us-east5-a-vllm | tpu_slice_v5p_8 | 2 | 1 | | us-east5-b-vllm | tpu_worker | 2 | 1 | | eu-west4-vllm | tpu_worker | 2 | 1 | | us-central1-vllm | tpu_worker | 1 | 0 | | us-central1-vllm | tpu_slice_v5p_8 | 2 | 1 | | us-central2-vllm | tpu_worker | 2 | 1 | | us-east1-d-vllm | tpu_worker | 2 | 1 |
Ray now serves ~11% of Marin's TPU fleet (39 of 339 ready nodes; Iris has 271). This caps max_workers on Ray cluster configs and drops min_workers to 0 - Iris is the warm pool now. Caps: - Single-host TPU pools (v4-8, v5e-4, v5p-8, v6e-4, v6e-8): max=4 - v5p-64 (us-central1): max=2 - v6e-128 (eu-west4-a): max=1 Shared-accelerator pools are capped together (us-central1 tpu_worker + tpu_slice_v5p_8 are both v5p-8). Follow-up to #4175. Follow-ups not in this PR: - infra/marin-big-run.yaml (non-standard naming, needs structural review) - infra/marin-us-east5.yaml (currently draining, let it settle) - infra/marin-us-west4.yaml (untagged workers, needs re-query)
## Summary Caps `max_workers` on Ray cluster TPU pools, and drops `min_workers` to 0 — Iris is the warm pool now. Ray currently serves ~11% of Marin's TPU fleet (39 of 339 READY nodes; Iris has 271). With Iris handling the bulk of workloads, Ray no longer needs standing warm capacity, and its `max_workers: 1024` ceilings make little sense on pools that peaked at single-digit concurrent workers. Follow-up to #4175, which halved `min_workers` on the same files. This PR continues the migration push by: - **`min_workers: 0`** on every pool that still had a nonzero floor. - **Single-host TPU pools** (v4-8, v5e-4, v5p-8, v6e-4, v6e-8): **`max_workers: 4`** — enough to run a small job, not enough to host serious training. - **`tpu_slice_v5p_64` in `us-central1`**: `max_workers: 2` (4 observed). - **`tpu_slice_v6e_128` in `eu-west4-a`**: `max_workers: 1` (1 observed). - **Shared-accelerator pools** capped together: `us-central1` and `us-east5-a` each have two v5p-8 pools (`tpu_worker` + `tpu_slice_v5p_8`); capping one without the other would be a loophole. ## Changes (15 edits across 12 files) | File | Pool | Accel | `min_workers` | `max_workers` | |---|---|---|---:|---:| | `infra/marin-us-central1.yaml` | `tpu_slice_v5p_64` | v5p-64 | 0 *(unchanged)* | **2** *(was 1024)* | | `infra/marin-us-central1.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0** *(was 6)* | **4** *(was 1024)* | | `infra/marin-us-central1.yaml` | `tpu_worker` | v5p-8 | 0 *(unchanged)* | **4** *(was 1024)* | | `infra/marin-us-east5-a.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0** *(was 4)* | **4** *(was 1024)* | | `infra/marin-us-east5-a.yaml` | `tpu_worker` | v5p-8 | **0** *(was 4)* | **4** *(was 1024)* | | `infra/marin-us-central2.yaml` | `tpu_worker` | v4-8 | **0** *(was 2)* | **4** *(was 1024)* | | `infra/marin-us-central2-staging.yaml` | `tpu_worker` | v4-8 | **0** *(was 2)* | **4** *(was 1024)* | | `infra/marin-eu-west4.yaml` | `tpu_worker` | v5litepod-4 | **0** *(was 2)* | **4** *(was 1024)* | | `infra/marin-eu-west4-a.yaml` | `tpu_slice_v6e_128` | v6e-128 | **0** *(was 1)* | **1** *(was 1024)* | | `infra/marin-eu-west4-vllm.yaml` | `tpu_worker` | v5litepod-4 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-central1-vllm.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-central2-vllm.yaml` | `tpu_worker` | v4-8 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-east1-d-vllm.yaml` | `tpu_worker` | v6e-8 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-east5-a-vllm.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-east5-b-vllm.yaml` | `tpu_worker` | v6e-8 | **0** *(was 1)* | **4** *(was 1024)* | Observed counts come from `gcloud asset search-all-resources --asset-types=tpu.googleapis.com/Node --query='labels.ray-cluster-name:*'` against `hai-gcp-models`, grouped by `(ray-cluster-name, ray-user-node-type, state)`. ## Expected impact - **`marin-us-central1`** is at ~95% v5p TPU utilization right now. After deploy, the autoscaler will block new scale-ups above the cap; existing workers finish their current tasks and are not forcibly terminated, but cannot be replaced once the cap is reached. Jobs will lose headroom over the next several hours rather than crashing immediately. **Please post in #infra Discord before deploying this one.** - All other live clusters are warm-idle, draining, or head-only — deploy impact is ~0. - Dead clusters are self-applying: the YAML change takes effect the next time anyone runs `ray up` against them. Since they have no live head node, there is nothing to push. ## Deploy After merge, per `infra/README.md`: ```bash # Live clusters — apply one at a time, verify autoscaler logs after each: uv run ray up -y infra/marin-us-east5-a.yaml uv run ray up -y infra/marin-eu-west4.yaml uv run ray up -y infra/marin-eu-west4-a.yaml uv run ray up -y infra/marin-us-central2.yaml uv run ray up -y infra/marin-us-central2-staging.yaml uv run ray up -y infra/marin-us-east1-d-vllm.yaml uv run ray up -y infra/marin-us-central1.yaml # post in #infra Discord first # Autoscaler log tail for verification: # ray exec infra/marin-<cluster>.yaml "tail -n 100 -f /tmp/ray/session_latest/logs/monitor*" ``` Dead clusters (`*-vllm`, `us-east1`) need no deploy — no live head node to push to. ## Not in this PR (deliberate) - **`infra/marin-big-run.yaml`** — 17 worker nodes in us-central2-b, uses a non-standard naming convention without the `ray-cluster-name` label. Needs structural review. Follow-up. - **`infra/marin-us-east5.yaml`** — currently draining (5 READY + 7 DELETING). Let it settle; revisit after. - **`infra/marin-us-west4.yaml`** — `ray status` showed 2 untagged workers; ambiguous in Cloud Asset Inventory. Needs re-query with a different filter. Follow-up. - **Iris `buffer_slices` bumps** in `lib/iris/examples/marin.yaml` — not needed to absorb the capacity (Iris `max_slices` are generously sized), but may be worth a reactive bump if migration friction surfaces. ## Test plan - [ ] CI: YAML parses cleanly (repo-level pre-commit YAML check passed locally). - [ ] Dry-run deploy on one dead cluster (e.g. `marin-eu-west4-vllm.yaml`) to confirm the new caps are accepted by Ray's config parser. Nothing is running there, so no risk. - [ ] Deploy live non-busy clusters first (us-east5-a, eu-west4, eu-west4-a, us-central2, us-central2-staging, us-east1-d-vllm). Tail autoscaler logs; confirm no new scale-ups above cap. - [ ] Post in #infra Discord, then deploy `marin-us-central1`. Monitor for job failures over the next few hours. Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
min_workersacross all Ray clusters (12 files: training, vllm, staging) to free TPU capacitymin_slicesinmarin.yaml: v5p-8 +8, v4-8 +2, v5e-4 +2, v5e-128 +1, v6e-128 +1e2-standard-4(16GB) toe2-highmem-4(32GB) in prod and devRay min_workers changes