Commit c3d7bb8
authored
Cap Ray max_workers, zero min_workers across migrated clusters (#4604)
## Summary
Caps `max_workers` on Ray cluster TPU pools, and drops `min_workers` to
0 — Iris is the warm pool now.
Ray currently serves ~11% of Marin's TPU fleet (39 of 339 READY nodes;
Iris has 271). With Iris handling the bulk of workloads, Ray no longer
needs standing warm capacity, and its `max_workers: 1024` ceilings make
little sense on pools that peaked at single-digit concurrent workers.
Follow-up to #4175, which halved `min_workers` on the same files. This
PR continues the migration push by:
- **`min_workers: 0`** on every pool that still had a nonzero floor.
- **Single-host TPU pools** (v4-8, v5e-4, v5p-8, v6e-4, v6e-8):
**`max_workers: 4`** — enough to run a small job, not enough to host
serious training.
- **`tpu_slice_v5p_64` in `us-central1`**: `max_workers: 2` (4
observed).
- **`tpu_slice_v6e_128` in `eu-west4-a`**: `max_workers: 1` (1
observed).
- **Shared-accelerator pools** capped together: `us-central1` and
`us-east5-a` each have two v5p-8 pools (`tpu_worker` +
`tpu_slice_v5p_8`); capping one without the other would be a loophole.
## Changes (15 edits across 12 files)
| File | Pool | Accel | `min_workers` | `max_workers` |
|---|---|---|---:|---:|
| `infra/marin-us-central1.yaml` | `tpu_slice_v5p_64` | v5p-64 | 0
*(unchanged)* | **2** *(was 1024)* |
| `infra/marin-us-central1.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0**
*(was 6)* | **4** *(was 1024)* |
| `infra/marin-us-central1.yaml` | `tpu_worker` | v5p-8 | 0
*(unchanged)* | **4** *(was 1024)* |
| `infra/marin-us-east5-a.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0**
*(was 4)* | **4** *(was 1024)* |
| `infra/marin-us-east5-a.yaml` | `tpu_worker` | v5p-8 | **0** *(was 4)*
| **4** *(was 1024)* |
| `infra/marin-us-central2.yaml` | `tpu_worker` | v4-8 | **0** *(was 2)*
| **4** *(was 1024)* |
| `infra/marin-us-central2-staging.yaml` | `tpu_worker` | v4-8 | **0**
*(was 2)* | **4** *(was 1024)* |
| `infra/marin-eu-west4.yaml` | `tpu_worker` | v5litepod-4 | **0** *(was
2)* | **4** *(was 1024)* |
| `infra/marin-eu-west4-a.yaml` | `tpu_slice_v6e_128` | v6e-128 | **0**
*(was 1)* | **1** *(was 1024)* |
| `infra/marin-eu-west4-vllm.yaml` | `tpu_worker` | v5litepod-4 | **0**
*(was 1)* | **4** *(was 1024)* |
| `infra/marin-us-central1-vllm.yaml` | `tpu_slice_v5p_8` | v5p-8 |
**0** *(was 1)* | **4** *(was 1024)* |
| `infra/marin-us-central2-vllm.yaml` | `tpu_worker` | v4-8 | **0**
*(was 1)* | **4** *(was 1024)* |
| `infra/marin-us-east1-d-vllm.yaml` | `tpu_worker` | v6e-8 | **0**
*(was 1)* | **4** *(was 1024)* |
| `infra/marin-us-east5-a-vllm.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0**
*(was 1)* | **4** *(was 1024)* |
| `infra/marin-us-east5-b-vllm.yaml` | `tpu_worker` | v6e-8 | **0**
*(was 1)* | **4** *(was 1024)* |
Observed counts come from `gcloud asset search-all-resources
--asset-types=tpu.googleapis.com/Node
--query='labels.ray-cluster-name:*'` against `hai-gcp-models`, grouped
by `(ray-cluster-name, ray-user-node-type, state)`.
## Expected impact
- **`marin-us-central1`** is at ~95% v5p TPU utilization right now.
After deploy, the autoscaler will block new scale-ups above the cap;
existing workers finish their current tasks and are not forcibly
terminated, but cannot be replaced once the cap is reached. Jobs will
lose headroom over the next several hours rather than crashing
immediately. **Please post in #infra Discord before deploying this
one.**
- All other live clusters are warm-idle, draining, or head-only — deploy
impact is ~0.
- Dead clusters are self-applying: the YAML change takes effect the next
time anyone runs `ray up` against them. Since they have no live head
node, there is nothing to push.
## Deploy
After merge, per `infra/README.md`:
```bash
# Live clusters — apply one at a time, verify autoscaler logs after each:
uv run ray up -y infra/marin-us-east5-a.yaml
uv run ray up -y infra/marin-eu-west4.yaml
uv run ray up -y infra/marin-eu-west4-a.yaml
uv run ray up -y infra/marin-us-central2.yaml
uv run ray up -y infra/marin-us-central2-staging.yaml
uv run ray up -y infra/marin-us-east1-d-vllm.yaml
uv run ray up -y infra/marin-us-central1.yaml # post in #infra Discord first
# Autoscaler log tail for verification:
# ray exec infra/marin-<cluster>.yaml "tail -n 100 -f /tmp/ray/session_latest/logs/monitor*"
```
Dead clusters (`*-vllm`, `us-east1`) need no deploy — no live head node
to push to.
## Not in this PR (deliberate)
- **`infra/marin-big-run.yaml`** — 17 worker nodes in us-central2-b,
uses a non-standard naming convention without the `ray-cluster-name`
label. Needs structural review. Follow-up.
- **`infra/marin-us-east5.yaml`** — currently draining (5 READY + 7
DELETING). Let it settle; revisit after.
- **`infra/marin-us-west4.yaml`** — `ray status` showed 2 untagged
workers; ambiguous in Cloud Asset Inventory. Needs re-query with a
different filter. Follow-up.
- **Iris `buffer_slices` bumps** in `lib/iris/examples/marin.yaml` — not
needed to absorb the capacity (Iris `max_slices` are generously sized),
but may be worth a reactive bump if migration friction surfaces.
## Test plan
- [ ] CI: YAML parses cleanly (repo-level pre-commit YAML check passed
locally).
- [ ] Dry-run deploy on one dead cluster (e.g.
`marin-eu-west4-vllm.yaml`) to confirm the new caps are accepted by
Ray's config parser. Nothing is running there, so no risk.
- [ ] Deploy live non-busy clusters first (us-east5-a, eu-west4,
eu-west4-a, us-central2, us-central2-staging, us-east1-d-vllm). Tail
autoscaler logs; confirm no new scale-ups above cap.
- [ ] Post in #infra Discord, then deploy `marin-us-central1`. Monitor
for job failures over the next few hours.
Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>1 parent 40e50a5 commit c3d7bb8
12 files changed
Lines changed: 28 additions & 28 deletions
File tree
- infra
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
193 | 193 | | |
194 | 194 | | |
195 | 195 | | |
196 | | - | |
197 | | - | |
| 196 | + | |
| 197 | + | |
198 | 198 | | |
199 | 199 | | |
200 | 200 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
128 | 128 | | |
129 | 129 | | |
130 | 130 | | |
131 | | - | |
132 | | - | |
| 131 | + | |
| 132 | + | |
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
136 | | - | |
137 | | - | |
| 136 | + | |
| 137 | + | |
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
140 | 140 | | |
141 | 141 | | |
142 | 142 | | |
143 | | - | |
144 | | - | |
| 143 | + | |
| 144 | + | |
145 | 145 | | |
146 | 146 | | |
147 | 147 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
136 | | - | |
| 136 | + | |
137 | 137 | | |
138 | 138 | | |
139 | 139 | | |
| |||
145 | 145 | | |
146 | 146 | | |
147 | 147 | | |
148 | | - | |
149 | | - | |
| 148 | + | |
| 149 | + | |
150 | 150 | | |
151 | 151 | | |
152 | 152 | | |
| |||
181 | 181 | | |
182 | 182 | | |
183 | 183 | | |
184 | | - | |
| 184 | + | |
185 | 185 | | |
186 | 186 | | |
187 | 187 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
136 | | - | |
137 | | - | |
| 136 | + | |
| 137 | + | |
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
128 | 128 | | |
129 | 129 | | |
130 | 130 | | |
131 | | - | |
132 | | - | |
| 131 | + | |
| 132 | + | |
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
136 | | - | |
137 | | - | |
| 136 | + | |
| 137 | + | |
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
128 | 128 | | |
129 | 129 | | |
130 | 130 | | |
131 | | - | |
132 | | - | |
| 131 | + | |
| 132 | + | |
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
140 | 140 | | |
141 | 141 | | |
142 | 142 | | |
143 | | - | |
144 | | - | |
| 143 | + | |
| 144 | + | |
145 | 145 | | |
146 | 146 | | |
147 | 147 | | |
| |||
0 commit comments