Skip to content

Commit 03e0b6e

Browse files
authored
infra: delete Ray cluster templates and retire Ray-cluster docs (#5132)
Part of the Ray retirement umbrella (#4453). Implements **stage 6** of `ray_removal_analysis.md`: cluster templates. ## Summary - Deleted 16 live Ray cluster configs (`infra/marin-{big-run,eu-west4,eu-west4-a,eu-west4-vllm,us-central1,us-central1-vllm,us-central2,us-central2-staging,us-central2-vllm,us-east1,us-east1-d-vllm,us-east5,us-east5-a,us-east5-a-vllm,us-east5-b-vllm,us-west4}.yaml`). - Deleted the two generator templates (`marin-cluster-template.yaml`, `marin-vllm-template.yaml`). - Stripped the `Our Cluster` and `Maintaining a Ray Cluster` sections from `infra/README.md`; kept the `Artifact Registry Cleanup Policy Management` section (Iris clusters use the same registry). - Net: 19 files changed, 4,049 lines removed. ## Ordering / dependencies This PR assumes the per-cluster `ray down` teardown (stage 7) has already been performed. Per the plan, the cluster retires this week or next — ordering stays code-first, cluster-last. ## Known residual references (out of scope) `rg 'infra/marin-' lib/ scripts/ docs/ .agents/ .github/` still returns hits in: - `lib/marin/src/marin/cluster/config.py` — scheduled for stage 5. - `lib/fray/src/fray/v1/cluster/ray/config.py` — scheduled for stage 3f. - `.agents/projects/linear_ce_loss.md`, `.agents/projects/vllm-docker.md` — historical logbooks the plan explicitly defers. - `infra/marin-tmux.sh` (not in the verify path, but now dead) — will go with stage 5 / stage 7. These were identified during verification and match the staging called out in the plan. ## Test plan - [x] `./infra/pre-commit.py infra/README.md` passes. - [x] `rg 'infra/marin-' lib/ scripts/ docs/ .agents/ .github/` returns only the residuals listed above (all scheduled for other stages). - [x] `infra/README.md` retains Artifact Registry section and renders cleanly. --------- Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
1 parent 003de39 commit 03e0b6e

20 files changed

Lines changed: 0 additions & 4086 deletions

infra/README.md

Lines changed: 0 additions & 208 deletions
Original file line numberDiff line numberDiff line change
@@ -12,21 +12,6 @@ We have several clusters for Marin, each with a different TPU type:
1212

1313

1414

15-
## Our Cluster
16-
17-
At a high-level, this directory provides all setup scripts and configuration files for standing up a Ray Cluster and
18-
interacting with it to do lightweight monitoring and reconfiguration. The architecture of our cluster is as follows:
19-
20-
- **Head Node**: A *persistent* (on-demand) [`n2-standard-8` GCP VM](https://cloud.google.com/compute/docs/general-purpose-machines) with 8 CPUs and 32 GB of
21-
RAM, and a 200 GB disk.
22-
- **Worker Nodes**: An autoscaling number of **preemptible** TPU v4-8 or v5e VMs; a minimum of 4 VMs will be kept alive
23-
at all times, with a maximum of 1024 VMs alive at once (we can increase this number).
24-
25-
In the v4 cluster, we use v4-8's as our worker nodes. In the v5e clusters, we use v5e-1's as our worker nodes.
26-
27-
The head node is responsible for coordinating the cluster, while the worker nodes are responsible for executing the
28-
actual tasks. In general, we try to avoid running any actual computation on the head node, as it is a shared resource.
29-
3015
## Ray
3116

3217
[Ray](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html) provides the underlying
@@ -89,197 +74,6 @@ Jobs should still follow these principles for preemptible compute:
8974
- **Checkpointable**: Write to GCS frequently, use small atomic units of work
9075
- **Streaming**: Avoid materializing entire datasets in memory
9176

92-
---
93-
94-
# Maintaining a Ray Cluster
95-
96-
## Setup
97-
98-
Install [gcloud](https://cloud.google.com/sdk/docs/install). On MacOS, you can download the CLI with `brew install gcloud-cli`.
99-
100-
You will also need to authenticate with GCP and set the default project.
101-
102-
```bash
103-
gcloud auth login # follow instructions
104-
gcloud auth application-default login # follow instructions
105-
gcloud config set project hai-gcp-models
106-
```
107-
108-
109-
## Cluster Management
110-
111-
Each cluster config is in a separate file in the `infra` directory. These files are automatically generated by the
112-
`scripts/ray/cluster.py` script, which reads the `infra/marin-cluster-template.yaml` file. **Do not edit the
113-
`cluster.yaml` files directly**. Instead, edit the template file and run the script to update the cluster configs.
114-
For short term testing, it's fine to edit the `cluster.yaml` directly, but remember to update the template file and
115-
regenerate the configs before merging. Check in the generated configs to the repo.
116-
117-
### Cluster management tool
118-
119-
For most operations, you can use the cluster management tool at `scripts/ray/cluster.py`. You can find the documentation
120-
in [scripts/ray/README.md]. Some sample commands:
121-
122-
```
123-
uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml {start-cluster,stop-cluster,restart-cluster}
124-
uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml {add-worker}
125-
uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml dashboard
126-
```
127-
128-
### Ray Commands
129-
130-
You can also use the Ray commands to directly manipulate clusters.
131-
132-
133-
```bash
134-
export CLUSTER=us-central2
135-
136-
# Launch the Cluster -- will automatically provision the head node, and start configuring the minimum number of workers
137-
ray up -y infra/marin-$CLUSTER.yaml
138-
139-
# Kill the Cluster (takes a while to gracefully terminate nodes)
140-
ray down -y infra/marin-$CLUSTER.yaml
141-
142-
# Monitor the Ray Autoscaler Logs (in case there are problems spinning up workers)
143-
# =>> Note that `exec` lets you run arbitrary commands on the head node!
144-
ray exec infra/marin-$CLUSTER.yaml "tail -n 100 -f /tmp/ray/session_latest/logs/monitor*"
145-
146-
# SSH into the Head Node
147-
ray attach infra/marin-$CLUSTER.yaml
148-
```
149-
150-
By default, each cluster is provisioned with a persistent `n2-standard-8` VM acting as
151-
the head node. At any given time, there should be a minimum of 4 TPU `v4-8` VMs acting as workers, with an autoscaling
152-
limit of 1024 VMs (so a maximum of 8 * 1024 = 8,192 total v4 cores or 4,096 v4 chips).
153-
154-
Each TPU `v4-8` VM actually has a surprising amount of CPUs and RAM (~240 CPUs, 200GB+ of RAM). However, Ray doesn't
155-
actually do anything under the hood to ensure that a job is actually using the number of logical resources specified
156-
(e.g., a job that on paper requests 1 CPU can use arbitrary cores/RAM). To mitigate this on the scheduler side,
157-
the config file configures each worker with only 120 visible CPUs.
158-
159-
### Restarting the Cluster
160-
161-
#### Restart Policy
162-
163-
When you need to restart a cluster, follow this policy:
164-
165-
1. **Notify**: Post in the #infra Discord channel about your plan to restart the cluster
166-
2. **Check for running jobs**: Check if there are any active jobs on the cluster
167-
3. **Ping affected users** (optional): If there are running jobs and you plan to use `--preserve-jobs=0`, ping the relevant people who own those jobs and give them time to respond (e.g., 15-30 minutes)
168-
4. **Proceed with restart**: After notification and any necessary waiting period, proceed with the restart
169-
170-
>[!NOTE]
171-
>The job restoration logic (enabled by default with `--preserve-jobs=1`) works reliably in most cases. However, being considerate of other users' work is still important.
172-
173-
**When to restart**: Restarts are appropriate when:
174-
- The cluster is in a broken state (e.g., workers not connecting)
175-
- The autoscaler is not functioning properly
176-
- Configuration changes require a fresh start
177-
178-
#### Common Restart Scenario
179-
180-
There is currently an error on the Ray autoscaler side with spot-TPU instances, where the Ray autoscaler is not able
181-
to detect when spot-TPU instances are dead and as a result, we may be left in a state with just the head node and
182-
no more spot-TPU worker instances starting up. When this state occurs, please message in the #infra Discord
183-
that you are going to restart the cluster, and then run `uv run scripts/ray/cluster.py --config <config> restart-cluster`.
184-
185-
#### Restart Options
186-
187-
* **Job preservation**: By default, `--preserve-jobs=1` backs up running jobs and resubmits them after restart. For a completely clean slate, use `--preserve-jobs=0`:
188-
```bash
189-
uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml restart-cluster --preserve-jobs=0
190-
```
191-
* **Reserved workers**: If there are any reserved workers on the cluster, see the instructions below, though in many cases the command above is all you need.
192-
193-
### Adding manual workers
194-
195-
Ray cannot automatically schedule TPUs using our reserved capacity. These must be added to the cluster manually.
196-
197-
```bash
198-
export CLUSTER=us-east5-b-vllm
199-
uv run scripts/ray/cluster.py --config infra/marin-us-east5-b-vllm.yaml add-worker v6e-8 --capacity reserved
200-
```
201-
202-
Remember to:
203-
1. Message in the #marin Discord channel before restarting
204-
2. Wait for the cluster to fully initialize before running jobs
205-
3. Be patient with the first job after restart as it may take ~10 minutes for workers to spin up
206-
207-
### Reconfiguring the Cluster
208-
209-
To reconfigure the cluster, you should generally use the `scripts/ray/cluster.py` script and the template
210-
file `infra/marin-cluster-template.yaml` and not modify the `cluster.yaml` directly. This script will update all the
211-
cluster configs in the `infra` directory with your changes.
212-
213-
In general, for additive operations like increasing the `max_workers` for autoscaling, you can just call `ray up`
214-
against the already-running cluster. For larger changes, like changing the machine type of the workers, you should bring
215-
the cluster down (`ray down`) and then bring it back up (`ray up`).
216-
217-
If you need to change something else about the cluster, e.g. if you're changing any of the initialization/setup
218-
commands, it's best to bring the entire cluster down (`ray down`), *then edit the `marin-cluster-template.yaml`*, and
219-
then bring the cluster back up (`ray up`); note that this will kill all VMs, including the head node.
220-
221-
#### Docker Image
222-
223-
If you need to make substantive changes to the machine software, you should change the Docker file at
224-
`docker/marin/Dockerfile.cluster`. Then run `make cluster_docker` to rebuild the Docker image and push it to the
225-
Google Artifact Registry. (Note that by default this will update the dockers for all clusters; if you only want to update it for one cluster, you can modify `CLUSTER_REPOS` variable in the Makefile). This will create a new image and a new tag, of the form
226-
`"us-central2-docker.pkg.dev/hai-gcp-models/marin/marin_cluster:<TAG>"`. Tags can include the latest commit hash and the
227-
date, for example:
228-
229-
```makefile
230-
CLUSTER_REPOS = us-central2 europe-west4 us-west4
231-
TAG_VERSIONS = latest $(shell git rev-parse --short HEAD) $(shell date -u +"%Y%m%d")
232-
```
233-
234-
The `cluster_docker` command will handle creating artifact repositories if they don't exist, building the Docker image,
235-
tagging it, and pushing it to all relevant regions and versions. If you run into a permissions error (e.g., 403) when pushing the Docker image, you may need to authenticate to the repo:
236-
237-
``` diff
238-
gcloud auth configure-docker us-central2-docker.pkg.dev
239-
gcloud auth configure-docker europe-west4-docker.pkg.dev
240-
gcloud auth configure-docker us-west4-docker.pkg.dev
241-
```
242-
243-
After building the Docker image and pushing it to the relevant regions and versions, you need to update the
244-
Ray configuration files to point to the latest version. `make cluster_docker` should update the `LATEST` tag
245-
in `src/main/cluster/config.py` for you, but you can check just in case.
246-
247-
Run `uv run scripts/ray/cluster.py update-configs` to regenerate the cluster configs. This will update each cluster
248-
config in the `infra` directory with the corresponding new Docker image tag.
249-
250-
After that, you can restart each cluster with `ray down` and `ray up`.
251-
252-
**If you use a cluster, please use the corresponding bucket, as data transfer costs between regions are high**
253-
254-
If you need to change something else about the cluster, e.g. if you're changing any of the initialization/setup
255-
commands, it's best to bring the entire cluster down (`ray down`), *then edit the `cluster.yaml`*, and then bring the
256-
cluster back up (`ray up`); note that this will kill all VMs, including the head node (nothing lasts forever).
257-
258-
**Word of Warning**: Ray looks at the actual `cluster_name` and various worker names/configs to "identify" existing/new
259-
clusters. To prevent orphaned states, do not change the names of the clusters without first bringing
260-
the cluster down!
261-
262-
263-
#### Environment Variables
264-
265-
We are currently using Google Secret Manager to store the environment variables that are needed to run the cluster.
266-
You can edit those secrets by going to the Google Cloud Console and navigating to the Secret Manager. Once you add
267-
a new version, you can cause the changes to propagate by killing the workers or restarting the cluster.
268-
269-
270-
### Adding TPU Nodes Manually to the Cluster
271-
272-
Ray only supports on demand and preemptible TPUs. For reserved nodes, we need to add them manually to the cluster.
273-
274-
The unified cluster manager provides this functionality:
275-
276-
```bash
277-
# Add a reserved TPU worker (functionality consolidated from manual_ray_worker_launch.py)
278-
uv run scripts/ray/cluster.py --config infra/marin-us-central2.yaml add-worker v4-128 --capacity reserved
279-
```
280-
281-
**Note**: This functionality is currently being integrated. Contact the team for assistance with adding reserved workers if needed.
282-
28377
## Artifact Registry Cleanup Policy Management
28478

28579
To keep our Docker artifact registries tidy, we provide a script and Makefile target to automatically configure a cleanup policy for all our standard GCP regions. This policy deletes images older than 30 days from the registry,
@@ -308,5 +102,3 @@ except we keep the most recent 16 tags.
308102
- After creating new Artifact Registry repositories in new regions.
309103
- Periodically, to ensure all regions have the correct cleanup policy applied.
310104
- After onboarding a new GCP project or changing repository names.
311-
312-
```

0 commit comments

Comments
 (0)