You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
infra: delete Ray cluster templates and retire Ray-cluster docs (#5132)
Part of the Ray retirement umbrella (#4453). Implements **stage 6** of
`ray_removal_analysis.md`: cluster templates.
## Summary
- Deleted 16 live Ray cluster configs
(`infra/marin-{big-run,eu-west4,eu-west4-a,eu-west4-vllm,us-central1,us-central1-vllm,us-central2,us-central2-staging,us-central2-vllm,us-east1,us-east1-d-vllm,us-east5,us-east5-a,us-east5-a-vllm,us-east5-b-vllm,us-west4}.yaml`).
- Deleted the two generator templates (`marin-cluster-template.yaml`,
`marin-vllm-template.yaml`).
- Stripped the `Our Cluster` and `Maintaining a Ray Cluster` sections
from `infra/README.md`; kept the `Artifact Registry Cleanup Policy
Management` section (Iris clusters use the same registry).
- Net: 19 files changed, 4,049 lines removed.
## Ordering / dependencies
This PR assumes the per-cluster `ray down` teardown (stage 7) has
already been performed. Per the plan, the cluster retires this week or
next — ordering stays code-first, cluster-last.
## Known residual references (out of scope)
`rg 'infra/marin-' lib/ scripts/ docs/ .agents/ .github/` still returns
hits in:
- `lib/marin/src/marin/cluster/config.py` — scheduled for stage 5.
- `lib/fray/src/fray/v1/cluster/ray/config.py` — scheduled for stage 3f.
- `.agents/projects/linear_ce_loss.md`,
`.agents/projects/vllm-docker.md` — historical logbooks the plan
explicitly defers.
- `infra/marin-tmux.sh` (not in the verify path, but now dead) — will go
with stage 5 / stage 7.
These were identified during verification and match the staging called
out in the plan.
## Test plan
- [x] `./infra/pre-commit.py infra/README.md` passes.
- [x] `rg 'infra/marin-' lib/ scripts/ docs/ .agents/ .github/` returns
only the residuals listed above (all scheduled for other stages).
- [x] `infra/README.md` retains Artifact Registry section and renders
cleanly.
---------
Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
Copy file name to clipboardExpand all lines: infra/README.md
-208Lines changed: 0 additions & 208 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,21 +12,6 @@ We have several clusters for Marin, each with a different TPU type:
12
12
13
13
14
14
15
-
## Our Cluster
16
-
17
-
At a high-level, this directory provides all setup scripts and configuration files for standing up a Ray Cluster and
18
-
interacting with it to do lightweight monitoring and reconfiguration. The architecture of our cluster is as follows:
19
-
20
-
-**Head Node**: A *persistent* (on-demand) [`n2-standard-8` GCP VM](https://cloud.google.com/compute/docs/general-purpose-machines) with 8 CPUs and 32 GB of
21
-
RAM, and a 200 GB disk.
22
-
-**Worker Nodes**: An autoscaling number of **preemptible** TPU v4-8 or v5e VMs; a minimum of 4 VMs will be kept alive
23
-
at all times, with a maximum of 1024 VMs alive at once (we can increase this number).
24
-
25
-
In the v4 cluster, we use v4-8's as our worker nodes. In the v5e clusters, we use v5e-1's as our worker nodes.
26
-
27
-
The head node is responsible for coordinating the cluster, while the worker nodes are responsible for executing the
28
-
actual tasks. In general, we try to avoid running any actual computation on the head node, as it is a shared resource.
29
-
30
15
## Ray
31
16
32
17
[Ray](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html) provides the underlying
@@ -89,197 +74,6 @@ Jobs should still follow these principles for preemptible compute:
89
74
-**Checkpointable**: Write to GCS frequently, use small atomic units of work
90
75
-**Streaming**: Avoid materializing entire datasets in memory
91
76
92
-
---
93
-
94
-
# Maintaining a Ray Cluster
95
-
96
-
## Setup
97
-
98
-
Install [gcloud](https://cloud.google.com/sdk/docs/install). On MacOS, you can download the CLI with `brew install gcloud-cli`.
99
-
100
-
You will also need to authenticate with GCP and set the default project.
Each cluster config is in a separate file in the `infra` directory. These files are automatically generated by the
112
-
`scripts/ray/cluster.py` script, which reads the `infra/marin-cluster-template.yaml` file. **Do not edit the
113
-
`cluster.yaml` files directly**. Instead, edit the template file and run the script to update the cluster configs.
114
-
For short term testing, it's fine to edit the `cluster.yaml` directly, but remember to update the template file and
115
-
regenerate the configs before merging. Check in the generated configs to the repo.
116
-
117
-
### Cluster management tool
118
-
119
-
For most operations, you can use the cluster management tool at `scripts/ray/cluster.py`. You can find the documentation
120
-
in [scripts/ray/README.md]. Some sample commands:
121
-
122
-
```
123
-
uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml {start-cluster,stop-cluster,restart-cluster}
124
-
uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml {add-worker}
125
-
uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml dashboard
126
-
```
127
-
128
-
### Ray Commands
129
-
130
-
You can also use the Ray commands to directly manipulate clusters.
131
-
132
-
133
-
```bash
134
-
export CLUSTER=us-central2
135
-
136
-
# Launch the Cluster -- will automatically provision the head node, and start configuring the minimum number of workers
137
-
ray up -y infra/marin-$CLUSTER.yaml
138
-
139
-
# Kill the Cluster (takes a while to gracefully terminate nodes)
140
-
ray down -y infra/marin-$CLUSTER.yaml
141
-
142
-
# Monitor the Ray Autoscaler Logs (in case there are problems spinning up workers)
143
-
# =>> Note that `exec` lets you run arbitrary commands on the head node!
144
-
ray exec infra/marin-$CLUSTER.yaml "tail -n 100 -f /tmp/ray/session_latest/logs/monitor*"
145
-
146
-
# SSH into the Head Node
147
-
ray attach infra/marin-$CLUSTER.yaml
148
-
```
149
-
150
-
By default, each cluster is provisioned with a persistent `n2-standard-8` VM acting as
151
-
the head node. At any given time, there should be a minimum of 4 TPU `v4-8` VMs acting as workers, with an autoscaling
152
-
limit of 1024 VMs (so a maximum of 8 * 1024 = 8,192 total v4 cores or 4,096 v4 chips).
153
-
154
-
Each TPU `v4-8` VM actually has a surprising amount of CPUs and RAM (~240 CPUs, 200GB+ of RAM). However, Ray doesn't
155
-
actually do anything under the hood to ensure that a job is actually using the number of logical resources specified
156
-
(e.g., a job that on paper requests 1 CPU can use arbitrary cores/RAM). To mitigate this on the scheduler side,
157
-
the config file configures each worker with only 120 visible CPUs.
158
-
159
-
### Restarting the Cluster
160
-
161
-
#### Restart Policy
162
-
163
-
When you need to restart a cluster, follow this policy:
164
-
165
-
1.**Notify**: Post in the #infra Discord channel about your plan to restart the cluster
166
-
2.**Check for running jobs**: Check if there are any active jobs on the cluster
167
-
3.**Ping affected users** (optional): If there are running jobs and you plan to use `--preserve-jobs=0`, ping the relevant people who own those jobs and give them time to respond (e.g., 15-30 minutes)
168
-
4.**Proceed with restart**: After notification and any necessary waiting period, proceed with the restart
169
-
170
-
>[!NOTE]
171
-
>The job restoration logic (enabled by default with `--preserve-jobs=1`) works reliably in most cases. However, being considerate of other users' work is still important.
172
-
173
-
**When to restart**: Restarts are appropriate when:
174
-
- The cluster is in a broken state (e.g., workers not connecting)
175
-
- The autoscaler is not functioning properly
176
-
- Configuration changes require a fresh start
177
-
178
-
#### Common Restart Scenario
179
-
180
-
There is currently an error on the Ray autoscaler side with spot-TPU instances, where the Ray autoscaler is not able
181
-
to detect when spot-TPU instances are dead and as a result, we may be left in a state with just the head node and
182
-
no more spot-TPU worker instances starting up. When this state occurs, please message in the #infra Discord
183
-
that you are going to restart the cluster, and then run `uv run scripts/ray/cluster.py --config <config> restart-cluster`.
184
-
185
-
#### Restart Options
186
-
187
-
***Job preservation**: By default, `--preserve-jobs=1` backs up running jobs and resubmits them after restart. For a completely clean slate, use `--preserve-jobs=0`:
188
-
```bash
189
-
uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml restart-cluster --preserve-jobs=0
190
-
```
191
-
***Reserved workers**: If there are any reserved workers on the cluster, see the instructions below, though in many cases the command above is all you need.
192
-
193
-
### Adding manual workers
194
-
195
-
Ray cannot automatically schedule TPUs using our reserved capacity. These must be added to the cluster manually.
196
-
197
-
```bash
198
-
export CLUSTER=us-east5-b-vllm
199
-
uv run scripts/ray/cluster.py --config infra/marin-us-east5-b-vllm.yaml add-worker v6e-8 --capacity reserved
200
-
```
201
-
202
-
Remember to:
203
-
1. Message in the #marin Discord channel before restarting
204
-
2. Wait for the cluster to fully initialize before running jobs
205
-
3. Be patient with the first job after restart as it may take ~10 minutes for workers to spin up
206
-
207
-
### Reconfiguring the Cluster
208
-
209
-
To reconfigure the cluster, you should generally use the `scripts/ray/cluster.py` script and the template
210
-
file `infra/marin-cluster-template.yaml` and not modify the `cluster.yaml` directly. This script will update all the
211
-
cluster configs in the `infra` directory with your changes.
212
-
213
-
In general, for additive operations like increasing the `max_workers` for autoscaling, you can just call `ray up`
214
-
against the already-running cluster. For larger changes, like changing the machine type of the workers, you should bring
215
-
the cluster down (`ray down`) and then bring it back up (`ray up`).
216
-
217
-
If you need to change something else about the cluster, e.g. if you're changing any of the initialization/setup
218
-
commands, it's best to bring the entire cluster down (`ray down`), *then edit the `marin-cluster-template.yaml`*, and
219
-
then bring the cluster back up (`ray up`); note that this will kill all VMs, including the head node.
220
-
221
-
#### Docker Image
222
-
223
-
If you need to make substantive changes to the machine software, you should change the Docker file at
224
-
`docker/marin/Dockerfile.cluster`. Then run `make cluster_docker` to rebuild the Docker image and push it to the
225
-
Google Artifact Registry. (Note that by default this will update the dockers for all clusters; if you only want to update it for one cluster, you can modify `CLUSTER_REPOS` variable in the Makefile). This will create a new image and a new tag, of the form
226
-
`"us-central2-docker.pkg.dev/hai-gcp-models/marin/marin_cluster:<TAG>"`. Tags can include the latest commit hash and the
The `cluster_docker` command will handle creating artifact repositories if they don't exist, building the Docker image,
235
-
tagging it, and pushing it to all relevant regions and versions. If you run into a permissions error (e.g., 403) when pushing the Docker image, you may need to authenticate to the repo:
After building the Docker image and pushing it to the relevant regions and versions, you need to update the
244
-
Ray configuration files to point to the latest version. `make cluster_docker` should update the `LATEST` tag
245
-
in `src/main/cluster/config.py` for you, but you can check just in case.
246
-
247
-
Run `uv run scripts/ray/cluster.py update-configs` to regenerate the cluster configs. This will update each cluster
248
-
config in the `infra` directory with the corresponding new Docker image tag.
249
-
250
-
After that, you can restart each cluster with `ray down` and `ray up`.
251
-
252
-
**If you use a cluster, please use the corresponding bucket, as data transfer costs between regions are high**
253
-
254
-
If you need to change something else about the cluster, e.g. if you're changing any of the initialization/setup
255
-
commands, it's best to bring the entire cluster down (`ray down`), *then edit the `cluster.yaml`*, and then bring the
256
-
cluster back up (`ray up`); note that this will kill all VMs, including the head node (nothing lasts forever).
257
-
258
-
**Word of Warning**: Ray looks at the actual `cluster_name` and various worker names/configs to "identify" existing/new
259
-
clusters. To prevent orphaned states, do not change the names of the clusters without first bringing
260
-
the cluster down!
261
-
262
-
263
-
#### Environment Variables
264
-
265
-
We are currently using Google Secret Manager to store the environment variables that are needed to run the cluster.
266
-
You can edit those secrets by going to the Google Cloud Console and navigating to the Secret Manager. Once you add
267
-
a new version, you can cause the changes to propagate by killing the workers or restarting the cluster.
268
-
269
-
270
-
### Adding TPU Nodes Manually to the Cluster
271
-
272
-
Ray only supports on demand and preemptible TPUs. For reserved nodes, we need to add them manually to the cluster.
273
-
274
-
The unified cluster manager provides this functionality:
275
-
276
-
```bash
277
-
# Add a reserved TPU worker (functionality consolidated from manual_ray_worker_launch.py)
278
-
uv run scripts/ray/cluster.py --config infra/marin-us-central2.yaml add-worker v4-128 --capacity reserved
279
-
```
280
-
281
-
**Note**: This functionality is currently being integrated. Contact the team for assistance with adding reserved workers if needed.
282
-
283
77
## Artifact Registry Cleanup Policy Management
284
78
285
79
To keep our Docker artifact registries tidy, we provide a script and Makefile target to automatically configure a cleanup policy for all our standard GCP regions. This policy deletes images older than 30 days from the registry,
@@ -308,5 +102,3 @@ except we keep the most recent 16 tags.
308
102
- After creating new Artifact Registry repositories in new regions.
309
103
- Periodically, to ensure all regions have the correct cleanup policy applied.
310
104
- After onboarding a new GCP project or changing repository names.
0 commit comments