Skip to content

Halve Ray cluster minimums, boost Iris capacity#4175

Merged
rjpower merged 4 commits intomainfrom
work/fnNrjysU
Mar 27, 2026
Merged

Halve Ray cluster minimums, boost Iris capacity#4175
rjpower merged 4 commits intomainfrom
work/fnNrjysU

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Mar 26, 2026

  • Halved min_workers across all Ray clusters (12 files: training, vllm, staging) to free TPU capacity
  • Transferred freed capacity to Iris min_slices in marin.yaml: v5p-8 +8, v4-8 +2, v5e-4 +2, v5e-128 +1, v6e-128 +1
  • Upgraded Iris controller VM from e2-standard-4 (16GB) to e2-highmem-4 (32GB) in prod and dev

Ray min_workers changes

Cluster Node type Before After
us-central1 tpu_worker (v5p-8) 1 0
us-central1 tpu_slice_v5p_8 12 6
us-central1 tpu_slice_v5p_16 1 0
us-central1 tpu_slice_v5p_32 1 0
us-central1 tpu_slice_v5p_64 1 0
us-central2 tpu_worker (v4-8) 4 2
us-central2-staging tpu_worker (v4-8) 4 2
eu-west4 tpu_worker (v5e-4) 4 2
eu-west4 tpu_slice_v5e_128 1 0
eu-west4-a tpu_slice_v6e_128 2 1
us-east5-a tpu_worker (v5p-8) 8 4
us-east5-a tpu_slice_v5p_8 8 4
us-east5-a-vllm tpu_worker 1 0
us-east5-a-vllm tpu_slice_v5p_8 2 1
us-east5-b-vllm tpu_worker 2 1
eu-west4-vllm tpu_worker 2 1
us-central1-vllm tpu_worker 1 0
us-central1-vllm tpu_slice_v5p_8 2 1
us-central2-vllm tpu_worker 2 1
us-east1-d-vllm tpu_worker 2 1

rjpower added 2 commits March 25, 2026 11:02
Analyze controller startup flow, scheduling loop, worker sync,
autoscaler, and checkpoint systems. Identify all side-effectful
operations that need gating for a --dry-run flag.
Shift capacity from Ray to Iris to accelerate migration:

Ray min_workers halved across all clusters (training + vllm + staging).
Freed capacity given to Iris as min_slices:
- tpu_v5p_8: 0 → 8 (absorbed from us-central1 + us-east5-a)
- tpu_v4_8: 0 → 2 (absorbed from us-central2)
- tpu_v5e_4: 1 → 3 (absorbed from eu-west4)
- tpu_v5e_128: 0 → 1 (absorbed from eu-west4)
- tpu_v6e_128: 0 → 1 (absorbed from eu-west4-a)

Iris controller VM upgraded from e2-standard-4 (16GB) to e2-highmem-4 (32GB)
in both prod and dev configs to address memory pressure.
@rjpower rjpower added the agent-generated Created by automation/agent label Mar 26, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 26, 2026

Claude finished @rjpower's task in 1m 28s —— View job


Review

  • Gather context and read changed files
  • Verify Ray min_workers changes match PR description
  • Review Iris config changes
  • Check for issues
  • Post review

Halves Ray min_workers across 12 cluster configs and transfers freed TPU capacity to Iris min_slices. Also bumps Iris controller VM to e2-highmem-4. All changes are mechanical config adjustments; the diff matches the PR description table.

Findings:

  • Stray research document in repo root: iris-controller-dry-run-analysis.md (177 lines) was added in commit 1a6bc2e. This is an unrelated research artifact and should not be checked into the repo root. Remove it from this PR or move it to an appropriate location (e.g., an issue or .agents/projects/). Left an inline comment on this.

  • Iris v5e-4 min_slices description: The PR summary says "v5e-4 +2" which reads as an increase of 2, but the actual change is min_slices: 1min_slices: 3. The delta is correct (+2), but noting this for clarity since the other entries start from 0.

  • All Ray min_workers changes verified against the PR description table -- every entry matches.

  • Controller VM upgrade (e2-standard-4e2-highmem-4) applied consistently to both marin.yaml and marin-dev.yaml.

No bugs or correctness issues in the config changes themselves.

Comment thread iris-controller-dry-run-analysis.md Outdated
@@ -0,0 +1,177 @@
# Iris Controller Dry-Run Mode: Codebase Analysis
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This research document appears to be from a separate task and shouldn't be committed to the repo root. Consider removing it from this PR (it was added in commit 1a6bc2e).

@rjpower rjpower requested a review from dlwh March 26, 2026 16:50
@rjpower rjpower enabled auto-merge (squash) March 26, 2026 16:51
@rjpower rjpower disabled auto-merge March 27, 2026 21:07
@rjpower rjpower merged commit 478516b into main Mar 27, 2026
41 checks passed
@rjpower rjpower deleted the work/fnNrjysU branch March 27, 2026 21:07
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
- Halved `min_workers` across all Ray clusters (12 files: training,
vllm, staging) to free TPU capacity
- Transferred freed capacity to Iris `min_slices` in `marin.yaml`: v5p-8
+8, v4-8 +2, v5e-4 +2, v5e-128 +1, v6e-128 +1
- Upgraded Iris controller VM from `e2-standard-4` (16GB) to
`e2-highmem-4` (32GB) in prod and dev

## Ray min_workers changes

| Cluster | Node type | Before | After |
|---|---|---|---|
| us-central1 | tpu_worker (v5p-8) | 1 | 0 |
| us-central1 | tpu_slice_v5p_8 | 12 | 6 |
| us-central1 | tpu_slice_v5p_16 | 1 | 0 |
| us-central1 | tpu_slice_v5p_32 | 1 | 0 |
| us-central1 | tpu_slice_v5p_64 | 1 | 0 |
| us-central2 | tpu_worker (v4-8) | 4 | 2 |
| us-central2-staging | tpu_worker (v4-8) | 4 | 2 |
| eu-west4 | tpu_worker (v5e-4) | 4 | 2 |
| eu-west4 | tpu_slice_v5e_128 | 1 | 0 |
| eu-west4-a | tpu_slice_v6e_128 | 2 | 1 |
| us-east5-a | tpu_worker (v5p-8) | 8 | 4 |
| us-east5-a | tpu_slice_v5p_8 | 8 | 4 |
| us-east5-a-vllm | tpu_worker | 1 | 0 |
| us-east5-a-vllm | tpu_slice_v5p_8 | 2 | 1 |
| us-east5-b-vllm | tpu_worker | 2 | 1 |
| eu-west4-vllm | tpu_worker | 2 | 1 |
| us-central1-vllm | tpu_worker | 1 | 0 |
| us-central1-vllm | tpu_slice_v5p_8 | 2 | 1 |
| us-central2-vllm | tpu_worker | 2 | 1 |
| us-east1-d-vllm | tpu_worker | 2 | 1 |
yonromai added a commit that referenced this pull request Apr 10, 2026
Ray now serves ~11% of Marin's TPU fleet (39 of 339 ready nodes; Iris
has 271). This caps max_workers on Ray cluster configs and drops
min_workers to 0 - Iris is the warm pool now.

Caps:
- Single-host TPU pools (v4-8, v5e-4, v5p-8, v6e-4, v6e-8): max=4
- v5p-64 (us-central1):  max=2
- v6e-128 (eu-west4-a):  max=1

Shared-accelerator pools are capped together (us-central1 tpu_worker +
tpu_slice_v5p_8 are both v5p-8).

Follow-up to #4175.

Follow-ups not in this PR:
- infra/marin-big-run.yaml (non-standard naming, needs structural review)
- infra/marin-us-east5.yaml (currently draining, let it settle)
- infra/marin-us-west4.yaml (untagged workers, needs re-query)
yonromai added a commit that referenced this pull request Apr 10, 2026
## Summary

Caps `max_workers` on Ray cluster TPU pools, and drops `min_workers` to
0 — Iris is the warm pool now.

Ray currently serves ~11% of Marin's TPU fleet (39 of 339 READY nodes;
Iris has 271). With Iris handling the bulk of workloads, Ray no longer
needs standing warm capacity, and its `max_workers: 1024` ceilings make
little sense on pools that peaked at single-digit concurrent workers.

Follow-up to #4175, which halved `min_workers` on the same files. This
PR continues the migration push by:

- **`min_workers: 0`** on every pool that still had a nonzero floor.
- **Single-host TPU pools** (v4-8, v5e-4, v5p-8, v6e-4, v6e-8):
**`max_workers: 4`** — enough to run a small job, not enough to host
serious training.
- **`tpu_slice_v5p_64` in `us-central1`**: `max_workers: 2` (4
observed).
- **`tpu_slice_v6e_128` in `eu-west4-a`**: `max_workers: 1` (1
observed).
- **Shared-accelerator pools** capped together: `us-central1` and
`us-east5-a` each have two v5p-8 pools (`tpu_worker` +
`tpu_slice_v5p_8`); capping one without the other would be a loophole.

## Changes (15 edits across 12 files)

| File | Pool | Accel | `min_workers` | `max_workers` |
|---|---|---|---:|---:|
| `infra/marin-us-central1.yaml` | `tpu_slice_v5p_64` | v5p-64 | 0
*(unchanged)* | **2** *(was 1024)* |
| `infra/marin-us-central1.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0**
*(was 6)* | **4** *(was 1024)* |
| `infra/marin-us-central1.yaml` | `tpu_worker` | v5p-8 | 0
*(unchanged)* | **4** *(was 1024)* |
| `infra/marin-us-east5-a.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0**
*(was 4)* | **4** *(was 1024)* |
| `infra/marin-us-east5-a.yaml` | `tpu_worker` | v5p-8 | **0** *(was 4)*
| **4** *(was 1024)* |
| `infra/marin-us-central2.yaml` | `tpu_worker` | v4-8 | **0** *(was 2)*
| **4** *(was 1024)* |
| `infra/marin-us-central2-staging.yaml` | `tpu_worker` | v4-8 | **0**
*(was 2)* | **4** *(was 1024)* |
| `infra/marin-eu-west4.yaml` | `tpu_worker` | v5litepod-4 | **0** *(was
2)* | **4** *(was 1024)* |
| `infra/marin-eu-west4-a.yaml` | `tpu_slice_v6e_128` | v6e-128 | **0**
*(was 1)* | **1** *(was 1024)* |
| `infra/marin-eu-west4-vllm.yaml` | `tpu_worker` | v5litepod-4 | **0**
*(was 1)* | **4** *(was 1024)* |
| `infra/marin-us-central1-vllm.yaml` | `tpu_slice_v5p_8` | v5p-8 |
**0** *(was 1)* | **4** *(was 1024)* |
| `infra/marin-us-central2-vllm.yaml` | `tpu_worker` | v4-8 | **0**
*(was 1)* | **4** *(was 1024)* |
| `infra/marin-us-east1-d-vllm.yaml` | `tpu_worker` | v6e-8 | **0**
*(was 1)* | **4** *(was 1024)* |
| `infra/marin-us-east5-a-vllm.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0**
*(was 1)* | **4** *(was 1024)* |
| `infra/marin-us-east5-b-vllm.yaml` | `tpu_worker` | v6e-8 | **0**
*(was 1)* | **4** *(was 1024)* |

Observed counts come from `gcloud asset search-all-resources
--asset-types=tpu.googleapis.com/Node
--query='labels.ray-cluster-name:*'` against `hai-gcp-models`, grouped
by `(ray-cluster-name, ray-user-node-type, state)`.

## Expected impact

- **`marin-us-central1`** is at ~95% v5p TPU utilization right now.
After deploy, the autoscaler will block new scale-ups above the cap;
existing workers finish their current tasks and are not forcibly
terminated, but cannot be replaced once the cap is reached. Jobs will
lose headroom over the next several hours rather than crashing
immediately. **Please post in #infra Discord before deploying this
one.**
- All other live clusters are warm-idle, draining, or head-only — deploy
impact is ~0.
- Dead clusters are self-applying: the YAML change takes effect the next
time anyone runs `ray up` against them. Since they have no live head
node, there is nothing to push.

## Deploy

After merge, per `infra/README.md`:

```bash
# Live clusters — apply one at a time, verify autoscaler logs after each:
uv run ray up -y infra/marin-us-east5-a.yaml
uv run ray up -y infra/marin-eu-west4.yaml
uv run ray up -y infra/marin-eu-west4-a.yaml
uv run ray up -y infra/marin-us-central2.yaml
uv run ray up -y infra/marin-us-central2-staging.yaml
uv run ray up -y infra/marin-us-east1-d-vllm.yaml
uv run ray up -y infra/marin-us-central1.yaml   # post in #infra Discord first

# Autoscaler log tail for verification:
# ray exec infra/marin-<cluster>.yaml "tail -n 100 -f /tmp/ray/session_latest/logs/monitor*"
```

Dead clusters (`*-vllm`, `us-east1`) need no deploy — no live head node
to push to.

## Not in this PR (deliberate)

- **`infra/marin-big-run.yaml`** — 17 worker nodes in us-central2-b,
uses a non-standard naming convention without the `ray-cluster-name`
label. Needs structural review. Follow-up.
- **`infra/marin-us-east5.yaml`** — currently draining (5 READY + 7
DELETING). Let it settle; revisit after.
- **`infra/marin-us-west4.yaml`** — `ray status` showed 2 untagged
workers; ambiguous in Cloud Asset Inventory. Needs re-query with a
different filter. Follow-up.
- **Iris `buffer_slices` bumps** in `lib/iris/examples/marin.yaml` — not
needed to absorb the capacity (Iris `max_slices` are generously sized),
but may be worth a reactive bump if migration friction surfaces.

## Test plan

- [ ] CI: YAML parses cleanly (repo-level pre-commit YAML check passed
locally).
- [ ] Dry-run deploy on one dead cluster (e.g.
`marin-eu-west4-vllm.yaml`) to confirm the new caps are accepted by
Ray's config parser. Nothing is running there, so no risk.
- [ ] Deploy live non-busy clusters first (us-east5-a, eu-west4,
eu-west4-a, us-central2, us-central2-staging, us-east1-d-vllm). Tail
autoscaler logs; confirm no new scale-ups above cap.
- [ ] Post in #infra Discord, then deploy `marin-us-central1`. Monitor
for job failures over the next few hours.

Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant