Skip to content

Commit c3d7bb8

Browse files
authored
Cap Ray max_workers, zero min_workers across migrated clusters (#4604)
## Summary Caps `max_workers` on Ray cluster TPU pools, and drops `min_workers` to 0 — Iris is the warm pool now. Ray currently serves ~11% of Marin's TPU fleet (39 of 339 READY nodes; Iris has 271). With Iris handling the bulk of workloads, Ray no longer needs standing warm capacity, and its `max_workers: 1024` ceilings make little sense on pools that peaked at single-digit concurrent workers. Follow-up to #4175, which halved `min_workers` on the same files. This PR continues the migration push by: - **`min_workers: 0`** on every pool that still had a nonzero floor. - **Single-host TPU pools** (v4-8, v5e-4, v5p-8, v6e-4, v6e-8): **`max_workers: 4`** — enough to run a small job, not enough to host serious training. - **`tpu_slice_v5p_64` in `us-central1`**: `max_workers: 2` (4 observed). - **`tpu_slice_v6e_128` in `eu-west4-a`**: `max_workers: 1` (1 observed). - **Shared-accelerator pools** capped together: `us-central1` and `us-east5-a` each have two v5p-8 pools (`tpu_worker` + `tpu_slice_v5p_8`); capping one without the other would be a loophole. ## Changes (15 edits across 12 files) | File | Pool | Accel | `min_workers` | `max_workers` | |---|---|---|---:|---:| | `infra/marin-us-central1.yaml` | `tpu_slice_v5p_64` | v5p-64 | 0 *(unchanged)* | **2** *(was 1024)* | | `infra/marin-us-central1.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0** *(was 6)* | **4** *(was 1024)* | | `infra/marin-us-central1.yaml` | `tpu_worker` | v5p-8 | 0 *(unchanged)* | **4** *(was 1024)* | | `infra/marin-us-east5-a.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0** *(was 4)* | **4** *(was 1024)* | | `infra/marin-us-east5-a.yaml` | `tpu_worker` | v5p-8 | **0** *(was 4)* | **4** *(was 1024)* | | `infra/marin-us-central2.yaml` | `tpu_worker` | v4-8 | **0** *(was 2)* | **4** *(was 1024)* | | `infra/marin-us-central2-staging.yaml` | `tpu_worker` | v4-8 | **0** *(was 2)* | **4** *(was 1024)* | | `infra/marin-eu-west4.yaml` | `tpu_worker` | v5litepod-4 | **0** *(was 2)* | **4** *(was 1024)* | | `infra/marin-eu-west4-a.yaml` | `tpu_slice_v6e_128` | v6e-128 | **0** *(was 1)* | **1** *(was 1024)* | | `infra/marin-eu-west4-vllm.yaml` | `tpu_worker` | v5litepod-4 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-central1-vllm.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-central2-vllm.yaml` | `tpu_worker` | v4-8 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-east1-d-vllm.yaml` | `tpu_worker` | v6e-8 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-east5-a-vllm.yaml` | `tpu_slice_v5p_8` | v5p-8 | **0** *(was 1)* | **4** *(was 1024)* | | `infra/marin-us-east5-b-vllm.yaml` | `tpu_worker` | v6e-8 | **0** *(was 1)* | **4** *(was 1024)* | Observed counts come from `gcloud asset search-all-resources --asset-types=tpu.googleapis.com/Node --query='labels.ray-cluster-name:*'` against `hai-gcp-models`, grouped by `(ray-cluster-name, ray-user-node-type, state)`. ## Expected impact - **`marin-us-central1`** is at ~95% v5p TPU utilization right now. After deploy, the autoscaler will block new scale-ups above the cap; existing workers finish their current tasks and are not forcibly terminated, but cannot be replaced once the cap is reached. Jobs will lose headroom over the next several hours rather than crashing immediately. **Please post in #infra Discord before deploying this one.** - All other live clusters are warm-idle, draining, or head-only — deploy impact is ~0. - Dead clusters are self-applying: the YAML change takes effect the next time anyone runs `ray up` against them. Since they have no live head node, there is nothing to push. ## Deploy After merge, per `infra/README.md`: ```bash # Live clusters — apply one at a time, verify autoscaler logs after each: uv run ray up -y infra/marin-us-east5-a.yaml uv run ray up -y infra/marin-eu-west4.yaml uv run ray up -y infra/marin-eu-west4-a.yaml uv run ray up -y infra/marin-us-central2.yaml uv run ray up -y infra/marin-us-central2-staging.yaml uv run ray up -y infra/marin-us-east1-d-vllm.yaml uv run ray up -y infra/marin-us-central1.yaml # post in #infra Discord first # Autoscaler log tail for verification: # ray exec infra/marin-<cluster>.yaml "tail -n 100 -f /tmp/ray/session_latest/logs/monitor*" ``` Dead clusters (`*-vllm`, `us-east1`) need no deploy — no live head node to push to. ## Not in this PR (deliberate) - **`infra/marin-big-run.yaml`** — 17 worker nodes in us-central2-b, uses a non-standard naming convention without the `ray-cluster-name` label. Needs structural review. Follow-up. - **`infra/marin-us-east5.yaml`** — currently draining (5 READY + 7 DELETING). Let it settle; revisit after. - **`infra/marin-us-west4.yaml`** — `ray status` showed 2 untagged workers; ambiguous in Cloud Asset Inventory. Needs re-query with a different filter. Follow-up. - **Iris `buffer_slices` bumps** in `lib/iris/examples/marin.yaml` — not needed to absorb the capacity (Iris `max_slices` are generously sized), but may be worth a reactive bump if migration friction surfaces. ## Test plan - [ ] CI: YAML parses cleanly (repo-level pre-commit YAML check passed locally). - [ ] Dry-run deploy on one dead cluster (e.g. `marin-eu-west4-vllm.yaml`) to confirm the new caps are accepted by Ray's config parser. Nothing is running there, so no risk. - [ ] Deploy live non-busy clusters first (us-east5-a, eu-west4, eu-west4-a, us-central2, us-central2-staging, us-east1-d-vllm). Tail autoscaler logs; confirm no new scale-ups above cap. - [ ] Post in #infra Discord, then deploy `marin-us-central1`. Monitor for job failures over the next few hours. Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
1 parent 40e50a5 commit c3d7bb8

12 files changed

Lines changed: 28 additions & 28 deletions

infra/marin-eu-west4-a.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -193,8 +193,8 @@ available_node_types:
193193
TPU: 4
194194

195195
tpu_slice_v6e_128:
196-
max_workers: 1024
197-
min_workers: 1
196+
max_workers: 1
197+
min_workers: 0
198198
node_config:
199199
acceleratorType: v6e-128
200200
runtimeVersion: v2-alpha-tpuv6e

infra/marin-eu-west4-vllm.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -128,8 +128,8 @@ available_node_types:
128128
# Set Source Image =>> Ubuntu 22.04 Base VM
129129
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
130130
tpu_worker:
131-
max_workers: 1024
132-
min_workers: 1
131+
max_workers: 4
132+
min_workers: 0
133133
node_config:
134134
acceleratorType: v5litepod-4
135135
runtimeVersion: v2-alpha-tpuv5-lite

infra/marin-eu-west4.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -133,8 +133,8 @@ available_node_types:
133133
# Set Source Image =>> Ubuntu 22.04 Base VM
134134
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
135135
tpu_worker:
136-
max_workers: 1024
137-
min_workers: 2
136+
max_workers: 4
137+
min_workers: 0
138138
node_config:
139139
acceleratorType: v5litepod-4
140140
runtimeVersion: v2-alpha-tpuv5-lite

infra/marin-us-central1-vllm.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -140,8 +140,8 @@ available_node_types:
140140
TPU: 4
141141

142142
tpu_slice_v5p_8:
143-
max_workers: 1024
144-
min_workers: 1
143+
max_workers: 4
144+
min_workers: 0
145145
node_config:
146146
acceleratorType: v5p-8
147147
runtimeVersion: v2-alpha-tpuv5

infra/marin-us-central1.yaml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ available_node_types:
133133
# Set Source Image =>> Ubuntu 22.04 Base VM
134134
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
135135
tpu_worker:
136-
max_workers: 1024
136+
max_workers: 4
137137
min_workers: 0
138138
node_config:
139139
acceleratorType: v5p-8
@@ -145,8 +145,8 @@ available_node_types:
145145
TPU: 4
146146

147147
tpu_slice_v5p_8:
148-
max_workers: 1024
149-
min_workers: 6
148+
max_workers: 4
149+
min_workers: 0
150150
node_config:
151151
acceleratorType: v5p-8
152152
runtimeVersion: v2-alpha-tpuv5
@@ -181,7 +181,7 @@ available_node_types:
181181
TPU: 4
182182

183183
tpu_slice_v5p_64:
184-
max_workers: 1024
184+
max_workers: 2
185185
min_workers: 0
186186
node_config:
187187
acceleratorType: v5p-64

infra/marin-us-central2-staging.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -133,8 +133,8 @@ available_node_types:
133133
# Set Source Image =>> Ubuntu 22.04 Base VM
134134
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
135135
tpu_worker:
136-
max_workers: 1024
137-
min_workers: 2
136+
max_workers: 4
137+
min_workers: 0
138138
node_config:
139139
acceleratorType: v4-8
140140
runtimeVersion: tpu-ubuntu2204-base

infra/marin-us-central2-vllm.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -128,8 +128,8 @@ available_node_types:
128128
# Set Source Image =>> Ubuntu 22.04 Base VM
129129
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
130130
tpu_worker:
131-
max_workers: 1024
132-
min_workers: 1
131+
max_workers: 4
132+
min_workers: 0
133133
node_config:
134134
acceleratorType: v4-8
135135
runtimeVersion: tpu-ubuntu2204-base

infra/marin-us-central2.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -133,8 +133,8 @@ available_node_types:
133133
# Set Source Image =>> Ubuntu 22.04 Base VM
134134
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
135135
tpu_worker:
136-
max_workers: 1024
137-
min_workers: 2
136+
max_workers: 4
137+
min_workers: 0
138138
node_config:
139139
acceleratorType: v4-8
140140
runtimeVersion: tpu-ubuntu2204-base

infra/marin-us-east1-d-vllm.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -128,8 +128,8 @@ available_node_types:
128128
# Set Source Image =>> Ubuntu 22.04 Base VM
129129
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
130130
tpu_worker:
131-
max_workers: 1024
132-
min_workers: 1
131+
max_workers: 4
132+
min_workers: 0
133133
node_config:
134134
acceleratorType: v6e-8
135135
runtimeVersion: v2-alpha-tpuv6e

infra/marin-us-east5-a-vllm.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -140,8 +140,8 @@ available_node_types:
140140
TPU: 4
141141

142142
tpu_slice_v5p_8:
143-
max_workers: 1024
144-
min_workers: 1
143+
max_workers: 4
144+
min_workers: 0
145145
node_config:
146146
acceleratorType: v5p-8
147147
runtimeVersion: v2-alpha-tpuv5

0 commit comments

Comments
 (0)