Halve Ray cluster minimums, boost Iris capacity by rjpower · Pull Request #4175 · marin-community/marin

rjpower · 2026-03-26T16:41:27Z

Halved min_workers across all Ray clusters (12 files: training, vllm, staging) to free TPU capacity
Transferred freed capacity to Iris min_slices in marin.yaml: v5p-8 +8, v4-8 +2, v5e-4 +2, v5e-128 +1, v6e-128 +1
Upgraded Iris controller VM from e2-standard-4 (16GB) to e2-highmem-4 (32GB) in prod and dev

Ray min_workers changes

Cluster	Node type	Before	After
us-central1	tpu_worker (v5p-8)	1	0
us-central1	tpu_slice_v5p_8	12	6
us-central1	tpu_slice_v5p_16	1	0
us-central1	tpu_slice_v5p_32	1	0
us-central1	tpu_slice_v5p_64	1	0
us-central2	tpu_worker (v4-8)	4	2
us-central2-staging	tpu_worker (v4-8)	4	2
eu-west4	tpu_worker (v5e-4)	4	2
eu-west4	tpu_slice_v5e_128	1	0
eu-west4-a	tpu_slice_v6e_128	2	1
us-east5-a	tpu_worker (v5p-8)	8	4
us-east5-a	tpu_slice_v5p_8	8	4
us-east5-a-vllm	tpu_worker	1	0
us-east5-a-vllm	tpu_slice_v5p_8	2	1
us-east5-b-vllm	tpu_worker	2	1
eu-west4-vllm	tpu_worker	2	1
us-central1-vllm	tpu_worker	1	0
us-central1-vllm	tpu_slice_v5p_8	2	1
us-central2-vllm	tpu_worker	2	1
us-east1-d-vllm	tpu_worker	2	1

Analyze controller startup flow, scheduling loop, worker sync, autoscaler, and checkpoint systems. Identify all side-effectful operations that need gating for a --dry-run flag.

Shift capacity from Ray to Iris to accelerate migration: Ray min_workers halved across all clusters (training + vllm + staging). Freed capacity given to Iris as min_slices: - tpu_v5p_8: 0 → 8 (absorbed from us-central1 + us-east5-a) - tpu_v4_8: 0 → 2 (absorbed from us-central2) - tpu_v5e_4: 1 → 3 (absorbed from eu-west4) - tpu_v5e_128: 0 → 1 (absorbed from eu-west4) - tpu_v6e_128: 0 → 1 (absorbed from eu-west4-a) Iris controller VM upgraded from e2-standard-4 (16GB) to e2-highmem-4 (32GB) in both prod and dev configs to address memory pressure.

claude · 2026-03-26T16:41:45Z

Claude finished @rjpower's task in 1m 28s —— View job

Review

Gather context and read changed files
Verify Ray min_workers changes match PR description
Review Iris config changes
Check for issues
Post review

Halves Ray min_workers across 12 cluster configs and transfers freed TPU capacity to Iris min_slices. Also bumps Iris controller VM to e2-highmem-4. All changes are mechanical config adjustments; the diff matches the PR description table.

Findings:

Stray research document in repo root: iris-controller-dry-run-analysis.md (177 lines) was added in commit 1a6bc2e. This is an unrelated research artifact and should not be checked into the repo root. Remove it from this PR or move it to an appropriate location (e.g., an issue or .agents/projects/). Left an inline comment on this.
Iris v5e-4 min_slices description: The PR summary says "v5e-4 +2" which reads as an increase of 2, but the actual change is min_slices: 1 → min_slices: 3. The delta is correct (+2), but noting this for clarity since the other entries start from 0.
All Ray min_workers changes verified against the PR description table -- every entry matches.
Controller VM upgrade (e2-standard-4 → e2-highmem-4) applied consistently to both marin.yaml and marin-dev.yaml.

No bugs or correctness issues in the config changes themselves.

claude · 2026-03-26T16:43:10Z

@@ -0,0 +1,177 @@
+# Iris Controller Dry-Run Mode: Codebase Analysis


This research document appears to be from a separate task and shouldn't be committed to the repo root. Consider removing it from this PR (it was added in commit 1a6bc2e).

- Halved `min_workers` across all Ray clusters (12 files: training, vllm, staging) to free TPU capacity - Transferred freed capacity to Iris `min_slices` in `marin.yaml`: v5p-8 +8, v4-8 +2, v5e-4 +2, v5e-128 +1, v6e-128 +1 - Upgraded Iris controller VM from `e2-standard-4` (16GB) to `e2-highmem-4` (32GB) in prod and dev ## Ray min_workers changes | Cluster | Node type | Before | After | |---|---|---|---| | us-central1 | tpu_worker (v5p-8) | 1 | 0 | | us-central1 | tpu_slice_v5p_8 | 12 | 6 | | us-central1 | tpu_slice_v5p_16 | 1 | 0 | | us-central1 | tpu_slice_v5p_32 | 1 | 0 | | us-central1 | tpu_slice_v5p_64 | 1 | 0 | | us-central2 | tpu_worker (v4-8) | 4 | 2 | | us-central2-staging | tpu_worker (v4-8) | 4 | 2 | | eu-west4 | tpu_worker (v5e-4) | 4 | 2 | | eu-west4 | tpu_slice_v5e_128 | 1 | 0 | | eu-west4-a | tpu_slice_v6e_128 | 2 | 1 | | us-east5-a | tpu_worker (v5p-8) | 8 | 4 | | us-east5-a | tpu_slice_v5p_8 | 8 | 4 | | us-east5-a-vllm | tpu_worker | 1 | 0 | | us-east5-a-vllm | tpu_slice_v5p_8 | 2 | 1 | | us-east5-b-vllm | tpu_worker | 2 | 1 | | eu-west4-vllm | tpu_worker | 2 | 1 | | us-central1-vllm | tpu_worker | 1 | 0 | | us-central1-vllm | tpu_slice_v5p_8 | 2 | 1 | | us-central2-vllm | tpu_worker | 2 | 1 | | us-east1-d-vllm | tpu_worker | 2 | 1 |

Ray now serves ~11% of Marin's TPU fleet (39 of 339 ready nodes; Iris has 271). This caps max_workers on Ray cluster configs and drops min_workers to 0 - Iris is the warm pool now. Caps: - Single-host TPU pools (v4-8, v5e-4, v5p-8, v6e-4, v6e-8): max=4 - v5p-64 (us-central1): max=2 - v6e-128 (eu-west4-a): max=1 Shared-accelerator pools are capped together (us-central1 tpu_worker + tpu_slice_v5p_8 are both v5p-8). Follow-up to #4175. Follow-ups not in this PR: - infra/marin-big-run.yaml (non-standard naming, needs structural review) - infra/marin-us-east5.yaml (currently draining, let it settle) - infra/marin-us-west4.yaml (untagged workers, needs re-query)

## Summary Caps `max_workers` on Ray cluster TPU pools, and drops `min_workers` to 0 — Iris is the warm pool now. Ray currently serves ~11% of Marin's TPU fleet (39 of 339 READY nodes; Iris has 271). With Iris handling the bulk of workloads, Ray no longer needs standing warm capacity, and its `max_workers: 1024` ceilings make little sense on pools that peaked at single-digit concurrent workers. Follow-up to #4175, which halved `min_workers` on the same files. This PR continues the migration push by: - **`min_workers: 0`** on every pool that still had a nonzero floor. - **Single-host TPU pools** (v4-8, v5e-4, v5p-8, v6e-4, v6e-8): **`max_workers: 4`** — enough to run a small job, not enough to host serious training. - **`tpu_slice_v5p_64` in `us-central1`**: `max_workers: 2` (4 observed). - **`tpu_slice_v6e_128` in `eu-west4-a`**: `max_workers: 1` (1 observed). - **Shared-accelerator pools** capped together: `us-central1` and `us-east5-a` each have two v5p-8 pools (`tpu_worker` + `tpu_slice_v5p_8`); capping one without the other would be a loophole. ## Changes (15 edits across 12 files) | File | Pool | Accel | `min_workers` | `max_workers` | |---|---|---|---:|---:| | `infra/marin-us-central1.yaml` | `tpu_slice_v5p_64` | v5p-64 | 0 *(unchanged)* | **2** *(was 1024)* | | `infra/marin-us-central1.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0** *(was 6)* | **4** *(was 1024)* | | `infra/marin-us-central1.yaml` | `tpu_worker` | v5p-8 | 0 *(unchanged)* | **4** *(was 1024)* | | `infra/marin-us-east5-a.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0** *(was 4)* | **4** *(was 1024)* | | `infra/marin-us-east5-a.yaml` | `tpu_worker` | v5p-8 | **0** *(was 4)* | **4** *(was 1024)* | | `infra/marin-us-central2.yaml` | `tpu_worker` | v4-8 | **0** *(was 2)* | **4** *(was 1024)* | | `infra/marin-us-central2-staging.yaml` | `tpu_worker` | v4-8 | **0** *(was 2)* | **4** *(was 1024)* | | `infra/marin-eu-west4.yaml` | `tpu_worker` | v5litepod-4 | **0** *(was 2)* | **4** *(was 1024)* | | `infra/marin-eu-west4-a.yaml` | `tpu_slice_v6e_128` | v6e-128 | **0** *(was 1)* | **1** *(was 1024)* | | `infra/marin-eu-west4-vllm.yaml` | `tpu_worker` | v5litepod-4 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-central1-vllm.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-central2-vllm.yaml` | `tpu_worker` | v4-8 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-east1-d-vllm.yaml` | `tpu_worker` | v6e-8 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-east5-a-vllm.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-east5-b-vllm.yaml` | `tpu_worker` | v6e-8 | **0** *(was 1)* | **4** *(was 1024)* | Observed counts come from `gcloud asset search-all-resources --asset-types=tpu.googleapis.com/Node --query='labels.ray-cluster-name:*'` against `hai-gcp-models`, grouped by `(ray-cluster-name, ray-user-node-type, state)`. ## Expected impact - **`marin-us-central1`** is at ~95% v5p TPU utilization right now. After deploy, the autoscaler will block new scale-ups above the cap; existing workers finish their current tasks and are not forcibly terminated, but cannot be replaced once the cap is reached. Jobs will lose headroom over the next several hours rather than crashing immediately. **Please post in #infra Discord before deploying this one.** - All other live clusters are warm-idle, draining, or head-only — deploy impact is ~0. - Dead clusters are self-applying: the YAML change takes effect the next time anyone runs `ray up` against them. Since they have no live head node, there is nothing to push. ## Deploy After merge, per `infra/README.md`: ```bash # Live clusters — apply one at a time, verify autoscaler logs after each: uv run ray up -y infra/marin-us-east5-a.yaml uv run ray up -y infra/marin-eu-west4.yaml uv run ray up -y infra/marin-eu-west4-a.yaml uv run ray up -y infra/marin-us-central2.yaml uv run ray up -y infra/marin-us-central2-staging.yaml uv run ray up -y infra/marin-us-east1-d-vllm.yaml uv run ray up -y infra/marin-us-central1.yaml # post in #infra Discord first # Autoscaler log tail for verification: # ray exec infra/marin-<cluster>.yaml "tail -n 100 -f /tmp/ray/session_latest/logs/monitor*" ``` Dead clusters (`*-vllm`, `us-east1`) need no deploy — no live head node to push to. ## Not in this PR (deliberate) - **`infra/marin-big-run.yaml`** — 17 worker nodes in us-central2-b, uses a non-standard naming convention without the `ray-cluster-name` label. Needs structural review. Follow-up. - **`infra/marin-us-east5.yaml`** — currently draining (5 READY + 7 DELETING). Let it settle; revisit after. - **`infra/marin-us-west4.yaml`** — `ray status` showed 2 untagged workers; ambiguous in Cloud Asset Inventory. Needs re-query with a different filter. Follow-up. - **Iris `buffer_slices` bumps** in `lib/iris/examples/marin.yaml` — not needed to absorb the capacity (Iris `max_slices` are generously sized), but may be worth a reactive bump if migration friction surfaces. ## Test plan - [ ] CI: YAML parses cleanly (repo-level pre-commit YAML check passed locally). - [ ] Dry-run deploy on one dead cluster (e.g. `marin-eu-west4-vllm.yaml`) to confirm the new caps are accepted by Ray's config parser. Nothing is running there, so no risk. - [ ] Deploy live non-busy clusters first (us-east5-a, eu-west4, eu-west4-a, us-central2, us-central2-staging, us-east1-d-vllm). Tail autoscaler logs; confirm no new scale-ups above cap. - [ ] Post in #infra Discord, then deploy `marin-us-central1`. Monitor for job failures over the next few hours. Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>

rjpower added 2 commits March 25, 2026 11:02

research: iris controller dry-run mode analysis

1a6bc2e

Analyze controller startup flow, scheduling loop, worker sync, autoscaler, and checkpoint systems. Identify all side-effectful operations that need gating for a --dry-run flag.

rjpower added the agent-generated Created by automation/agent label Mar 26, 2026

claude Bot reviewed Mar 26, 2026

View reviewed changes

Delete iris-controller-dry-run-analysis.md

aaebb31

rjpower requested a review from dlwh March 26, 2026 16:50

rjpower enabled auto-merge (squash) March 26, 2026 16:51

Increase max_slices from 1 to 4 in marin.yaml

4658edf

rjpower disabled auto-merge March 27, 2026 21:07

rjpower merged commit 478516b into main Mar 27, 2026
41 checks passed

rjpower deleted the work/fnNrjysU branch March 27, 2026 21:07

yonromai mentioned this pull request Apr 10, 2026

Cap Ray max_workers, zero min_workers across migrated clusters #4604

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Halve Ray cluster minimums, boost Iris capacity#4175

Halve Ray cluster minimums, boost Iris capacity#4175
rjpower merged 4 commits intomainfrom
work/fnNrjysU

rjpower commented Mar 26, 2026 •

edited

Loading

Uh oh!

claude Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

claude Bot Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,177 @@
		# Iris Controller Dry-Run Mode: Codebase Analysis

Conversation

rjpower commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ray min_workers changes

Uh oh!

claude Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Uh oh!

claude Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rjpower commented Mar 26, 2026 •

edited

Loading

claude Bot commented Mar 26, 2026 •

edited

Loading