marin-community · yonromai · Apr 23, 2026 · Apr 23, 2026 · Apr 23, 2026
diff --git a/infra/README.md b/infra/README.md
@@ -12,21 +12,6 @@ We have several clusters for Marin, each with a different TPU type:
 
 
 
-## Our Cluster
-
-At a high-level, this directory provides all setup scripts and configuration files for standing up a Ray Cluster and
-interacting with it to do lightweight monitoring and reconfiguration. The architecture of our cluster is as follows:
-
-- **Head Node**: A *persistent* (on-demand) [`n2-standard-8` GCP VM](https://cloud.google.com/compute/docs/general-purpose-machines) with 8 CPUs and 32 GB of
-  RAM, and a 200 GB disk.
-- **Worker Nodes**: An autoscaling number of **preemptible** TPU v4-8 or v5e VMs; a minimum of 4 VMs will be kept alive
- at all times, with a maximum of 1024 VMs alive at once (we can increase this number).
-
-In the v4 cluster, we use v4-8's as our worker nodes. In the v5e clusters, we use v5e-1's as our worker nodes.
-
-The head node is responsible for coordinating the cluster, while the worker nodes are responsible for executing the
-actual tasks. In general, we try to avoid running any actual computation on the head node, as it is a shared resource.
-
 ## Ray
 
 [Ray](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html) provides the underlying
@@ -89,197 +74,6 @@ Jobs should still follow these principles for preemptible compute:
 - **Checkpointable**: Write to GCS frequently, use small atomic units of work
 - **Streaming**: Avoid materializing entire datasets in memory
 
----
-
-# Maintaining a Ray Cluster
-
-## Setup
-
-Install [gcloud](https://cloud.google.com/sdk/docs/install). On MacOS, you can download the CLI with `brew install gcloud-cli`.
-
-You will also need to authenticate with GCP and set the default project.
-
-```bash
-gcloud auth login  # follow instructions
-gcloud auth application-default login  # follow instructions
-gcloud config set project hai-gcp-models
-```
-
-
-## Cluster Management
-
-Each cluster config is in a separate file in the `infra` directory. These files are automatically generated by the
-`scripts/ray/cluster.py` script, which reads the `infra/marin-cluster-template.yaml` file. **Do not edit the
-`cluster.yaml` files directly**. Instead, edit the template file and run the script to update the cluster configs.
-For short term testing, it's fine to edit the `cluster.yaml` directly, but remember to update the template file and
-regenerate the configs before merging. Check in the generated configs to the repo.
-
-### Cluster management tool
-
-For most operations, you can use the cluster management tool at `scripts/ray/cluster.py`. You can find the documentation
-in [scripts/ray/README.md]. Some sample commands:
-
-```
-uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml {start-cluster,stop-cluster,restart-cluster}
-uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml {add-worker}
-uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml dashboard
-```
-
-### Ray Commands
-
-You can also use the Ray commands to directly manipulate clusters.
-
-
-```bash
-export CLUSTER=us-central2
-
-# Launch the Cluster -- will automatically provision the head node, and start configuring the minimum number of workers
-ray up -y infra/marin-$CLUSTER.yaml
-
-# Kill the Cluster (takes a while to gracefully terminate nodes)
-ray down -y infra/marin-$CLUSTER.yaml
-
-# Monitor the Ray Autoscaler Logs (in case there are problems spinning up workers)
-#   =>> Note that `exec` lets you run arbitrary commands on the head node!
-ray exec infra/marin-$CLUSTER.yaml "tail -n 100 -f /tmp/ray/session_latest/logs/monitor*"
-
-# SSH into the Head Node
-ray attach infra/marin-$CLUSTER.yaml
-```
-
-By default, each cluster is provisioned with a persistent `n2-standard-8` VM acting as
-the head node. At any given time, there should be a minimum of 4 TPU `v4-8` VMs acting as workers, with an autoscaling
-limit of 1024 VMs (so a maximum of 8 * 1024 = 8,192 total v4 cores or 4,096 v4 chips).
-
-Each TPU `v4-8` VM actually has a surprising amount of CPUs and RAM (~240 CPUs, 200GB+ of RAM). However, Ray doesn't
-actually do anything under the hood to ensure that a job is actually using the number of logical resources specified
-(e.g., a job that on paper requests 1 CPU can use arbitrary cores/RAM). To mitigate this on the scheduler side,
-the config file configures each worker with only 120 visible CPUs.
-
-### Restarting the Cluster
-
-#### Restart Policy
-
-When you need to restart a cluster, follow this policy:
-
-1. **Notify**: Post in the #infra Discord channel about your plan to restart the cluster
-2. **Check for running jobs**: Check if there are any active jobs on the cluster
-3. **Ping affected users** (optional): If there are running jobs and you plan to use `--preserve-jobs=0`, ping the relevant people who own those jobs and give them time to respond (e.g., 15-30 minutes)
-4. **Proceed with restart**: After notification and any necessary waiting period, proceed with the restart
-
->[!NOTE]
->The job restoration logic (enabled by default with `--preserve-jobs=1`) works reliably in most cases. However, being considerate of other users' work is still important.
-
-**When to restart**: Restarts are appropriate when:
-- The cluster is in a broken state (e.g., workers not connecting)
-- The autoscaler is not functioning properly
-- Configuration changes require a fresh start
-
-#### Common Restart Scenario
-
-There is currently an error on the Ray autoscaler side with spot-TPU instances, where the Ray autoscaler is not able
-to detect when spot-TPU instances are dead and as a result, we may be left in a state with just the head node and
-no more spot-TPU worker instances starting up. When this state occurs, please message in the #infra Discord
-that you are going to restart the cluster, and then run `uv run scripts/ray/cluster.py --config <config> restart-cluster`.
-
-#### Restart Options
-
-* **Job preservation**: By default, `--preserve-jobs=1` backs up running jobs and resubmits them after restart. For a completely clean slate, use `--preserve-jobs=0`:
-  ```bash
-  uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml restart-cluster --preserve-jobs=0
-  ```
-* **Reserved workers**: If there are any reserved workers on the cluster, see the instructions below, though in many cases the command above is all you need.
-
-### Adding manual workers
-
-Ray cannot automatically schedule TPUs using our reserved capacity. These must be added to the cluster manually.
-
-```bash
-export CLUSTER=us-east5-b-vllm
-uv run scripts/ray/cluster.py --config infra/marin-us-east5-b-vllm.yaml add-worker v6e-8 --capacity reserved
-```
-
-Remember to:
-1. Message in the #marin Discord channel before restarting
-2. Wait for the cluster to fully initialize before running jobs
-3. Be patient with the first job after restart as it may take ~10 minutes for workers to spin up
-
-### Reconfiguring the Cluster
-
-To reconfigure the cluster, you should generally use the `scripts/ray/cluster.py` script and the template
-file `infra/marin-cluster-template.yaml` and not modify the `cluster.yaml` directly. This script will update all the
-cluster configs in the `infra` directory with your changes.
-
-In general, for additive operations like increasing the `max_workers` for autoscaling, you can just call `ray up`
-against the already-running cluster. For larger changes, like changing the machine type of the workers, you should bring
-the cluster down (`ray down`) and then bring it back up (`ray up`).
-
-If you need to change something else about the cluster, e.g. if you're changing any of the initialization/setup
-commands, it's best to bring the entire cluster down (`ray down`), *then edit the `marin-cluster-template.yaml`*, and
-then bring the cluster back up (`ray up`); note that this will kill all VMs, including the head node.
-
-#### Docker Image
-
-If you need to make substantive changes to the machine software, you should change the Docker file at
-`docker/marin/Dockerfile.cluster`. Then run `make cluster_docker` to rebuild the Docker image and push it to the
-Google Artifact Registry. (Note that by default this will update the dockers for all clusters; if you only want to update it for one cluster, you can modify `CLUSTER_REPOS` variable in the Makefile). This will create a new image and a new tag, of the form
-`"us-central2-docker.pkg.dev/hai-gcp-models/marin/marin_cluster:<TAG>"`. Tags can include the latest commit hash and the
-date, for example:
-
-```makefile
-CLUSTER_REPOS = us-central2 europe-west4 us-west4
-TAG_VERSIONS = latest $(shell git rev-parse --short HEAD) $(shell date -u +"%Y%m%d")
-```
-
-The `cluster_docker` command will handle creating artifact repositories if they don't exist, building the Docker image,
-tagging it, and pushing it to all relevant regions and versions. If you run into a permissions error (e.g., 403) when pushing the Docker image, you may need to authenticate to the repo:
-
-``` diff
-gcloud auth configure-docker us-central2-docker.pkg.dev
-gcloud auth configure-docker europe-west4-docker.pkg.dev
-gcloud auth configure-docker us-west4-docker.pkg.dev
-```
-
-After building the Docker image and pushing it to the relevant regions and versions, you need to update the
-Ray configuration files to point to the latest version. `make cluster_docker` should update the `LATEST` tag
-in `src/main/cluster/config.py` for you, but you can check just in case.
-
-Run `uv run scripts/ray/cluster.py update-configs` to regenerate the cluster configs. This will update each cluster
-config in the `infra` directory with the corresponding new Docker image tag.
-
-After that, you can restart each cluster with `ray down` and `ray up`.
-
-**If you use a cluster, please use the corresponding bucket, as data transfer costs between regions are high**
-
-If you need to change something else about the cluster, e.g. if you're changing any of the initialization/setup
-commands, it's best to bring the entire cluster down (`ray down`), *then edit the `cluster.yaml`*, and then bring the
-cluster back up (`ray up`); note that this will kill all VMs, including the head node (nothing lasts forever).
-
-**Word of Warning**: Ray looks at the actual `cluster_name` and various worker names/configs to "identify" existing/new
-clusters. To prevent orphaned states, do not change the names of the clusters without first bringing
-the cluster down!
-
-
-#### Environment Variables
-
-We are currently using Google Secret Manager to store the environment variables that are needed to run the cluster.
-You can edit those secrets by going to the Google Cloud Console and navigating to the Secret Manager. Once you add
-a new version, you can cause the changes to propagate by killing the workers or restarting the cluster.
-
-
-### Adding TPU Nodes Manually to the Cluster
-
-Ray only supports on demand and preemptible TPUs. For reserved nodes, we need to add them manually to the cluster.
-
-The unified cluster manager provides this functionality:
-
-```bash
-# Add a reserved TPU worker (functionality consolidated from manual_ray_worker_launch.py)
-uv run scripts/ray/cluster.py --config infra/marin-us-central2.yaml add-worker v4-128 --capacity reserved
-```
-
-**Note**: This functionality is currently being integrated. Contact the team for assistance with adding reserved workers if needed.
-
 ## Artifact Registry Cleanup Policy Management
 
 To keep our Docker artifact registries tidy, we provide a script and Makefile target to automatically configure a cleanup policy for all our standard GCP regions. This policy deletes images older than 30 days from the registry,
@@ -308,5 +102,3 @@ except we keep the most recent 16 tags.
 - After creating new Artifact Registry repositories in new regions.
 - Periodically, to ensure all regions have the correct cleanup policy applied.
 - After onboarding a new GCP project or changing repository names.
-
-```