Skip to content

Commit e31b270

Browse files
committed
Halve Ray cluster min_workers, boost Iris min_slices and controller VM
Shift capacity from Ray to Iris to accelerate migration: Ray min_workers halved across all clusters (training + vllm + staging). Freed capacity given to Iris as min_slices: - tpu_v5p_8: 0 → 8 (absorbed from us-central1 + us-east5-a) - tpu_v4_8: 0 → 2 (absorbed from us-central2) - tpu_v5e_4: 1 → 3 (absorbed from eu-west4) - tpu_v5e_128: 0 → 1 (absorbed from eu-west4) - tpu_v6e_128: 0 → 1 (absorbed from eu-west4-a) Iris controller VM upgraded from e2-standard-4 (16GB) to e2-highmem-4 (32GB) in both prod and dev configs to address memory pressure.
1 parent 1a6bc2e commit e31b270

14 files changed

Lines changed: 27 additions & 27 deletions

infra/marin-eu-west4-a.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@ available_node_types:
194194

195195
tpu_slice_v6e_128:
196196
max_workers: 1024
197-
min_workers: 2
197+
min_workers: 1
198198
node_config:
199199
acceleratorType: v6e-128
200200
runtimeVersion: v2-alpha-tpuv6e

infra/marin-eu-west4-vllm.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@ available_node_types:
129129
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
130130
tpu_worker:
131131
max_workers: 1024
132-
min_workers: 2
132+
min_workers: 1
133133
node_config:
134134
acceleratorType: v5litepod-4
135135
runtimeVersion: v2-alpha-tpuv5-lite

infra/marin-eu-west4.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ available_node_types:
134134
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
135135
tpu_worker:
136136
max_workers: 1024
137-
min_workers: 4
137+
min_workers: 2
138138
node_config:
139139
acceleratorType: v5litepod-4
140140
runtimeVersion: v2-alpha-tpuv5-lite
@@ -194,7 +194,7 @@ available_node_types:
194194

195195
tpu_slice_v5e_128:
196196
max_workers: 1024
197-
min_workers: 1
197+
min_workers: 0
198198
node_config:
199199
acceleratorType: v5litepod-128
200200
runtimeVersion: v2-alpha-tpuv5-lite

infra/marin-us-central1-vllm.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@ available_node_types:
129129
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
130130
tpu_worker:
131131
max_workers: 1024
132-
min_workers: 1
132+
min_workers: 0
133133
node_config:
134134
acceleratorType: v5p-8
135135
runtimeVersion: v2-alpha-tpuv5
@@ -141,7 +141,7 @@ available_node_types:
141141

142142
tpu_slice_v5p_8:
143143
max_workers: 1024
144-
min_workers: 2
144+
min_workers: 1
145145
node_config:
146146
acceleratorType: v5p-8
147147
runtimeVersion: v2-alpha-tpuv5

infra/marin-us-central1.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ available_node_types:
134134
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
135135
tpu_worker:
136136
max_workers: 1024
137-
min_workers: 1
137+
min_workers: 0
138138
node_config:
139139
acceleratorType: v5p-8
140140
runtimeVersion: v2-alpha-tpuv5
@@ -146,7 +146,7 @@ available_node_types:
146146

147147
tpu_slice_v5p_8:
148148
max_workers: 1024
149-
min_workers: 12
149+
min_workers: 6
150150
node_config:
151151
acceleratorType: v5p-8
152152
runtimeVersion: v2-alpha-tpuv5
@@ -158,7 +158,7 @@ available_node_types:
158158

159159
tpu_slice_v5p_16:
160160
max_workers: 1024
161-
min_workers: 1
161+
min_workers: 0
162162
node_config:
163163
acceleratorType: v5p-16
164164
runtimeVersion: v2-alpha-tpuv5
@@ -170,7 +170,7 @@ available_node_types:
170170

171171
tpu_slice_v5p_32:
172172
max_workers: 1024
173-
min_workers: 1
173+
min_workers: 0
174174
node_config:
175175
acceleratorType: v5p-32
176176
runtimeVersion: v2-alpha-tpuv5
@@ -182,7 +182,7 @@ available_node_types:
182182

183183
tpu_slice_v5p_64:
184184
max_workers: 1024
185-
min_workers: 1
185+
min_workers: 0
186186
node_config:
187187
acceleratorType: v5p-64
188188
runtimeVersion: v2-alpha-tpuv5

infra/marin-us-central2-staging.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ available_node_types:
134134
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
135135
tpu_worker:
136136
max_workers: 1024
137-
min_workers: 4
137+
min_workers: 2
138138
node_config:
139139
acceleratorType: v4-8
140140
runtimeVersion: tpu-ubuntu2204-base

infra/marin-us-central2-vllm.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@ available_node_types:
129129
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
130130
tpu_worker:
131131
max_workers: 1024
132-
min_workers: 2
132+
min_workers: 1
133133
node_config:
134134
acceleratorType: v4-8
135135
runtimeVersion: tpu-ubuntu2204-base

infra/marin-us-central2.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,7 @@ available_node_types:
134134
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
135135
tpu_worker:
136136
max_workers: 1024
137-
min_workers: 4
137+
min_workers: 2
138138
node_config:
139139
acceleratorType: v4-8
140140
runtimeVersion: tpu-ubuntu2204-base

infra/marin-us-east1-d-vllm.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@ available_node_types:
129129
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
130130
tpu_worker:
131131
max_workers: 1024
132-
min_workers: 2
132+
min_workers: 1
133133
node_config:
134134
acceleratorType: v6e-8
135135
runtimeVersion: v2-alpha-tpuv6e

infra/marin-us-east5-a-vllm.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@ available_node_types:
129129
sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
130130
tpu_worker:
131131
max_workers: 1024
132-
min_workers: 1
132+
min_workers: 0
133133
node_config:
134134
acceleratorType: v5p-8
135135
runtimeVersion: v2-alpha-tpuv5
@@ -141,7 +141,7 @@ available_node_types:
141141

142142
tpu_slice_v5p_8:
143143
max_workers: 1024
144-
min_workers: 2
144+
min_workers: 1
145145
node_config:
146146
acceleratorType: v5p-8
147147
runtimeVersion: v2-alpha-tpuv5

0 commit comments

Comments
 (0)