Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
208 changes: 0 additions & 208 deletions infra/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,21 +12,6 @@ We have several clusters for Marin, each with a different TPU type:



## Our Cluster

At a high-level, this directory provides all setup scripts and configuration files for standing up a Ray Cluster and
interacting with it to do lightweight monitoring and reconfiguration. The architecture of our cluster is as follows:

- **Head Node**: A *persistent* (on-demand) [`n2-standard-8` GCP VM](https://cloud.google.com/compute/docs/general-purpose-machines) with 8 CPUs and 32 GB of
RAM, and a 200 GB disk.
- **Worker Nodes**: An autoscaling number of **preemptible** TPU v4-8 or v5e VMs; a minimum of 4 VMs will be kept alive
at all times, with a maximum of 1024 VMs alive at once (we can increase this number).

In the v4 cluster, we use v4-8's as our worker nodes. In the v5e clusters, we use v5e-1's as our worker nodes.

The head node is responsible for coordinating the cluster, while the worker nodes are responsible for executing the
actual tasks. In general, we try to avoid running any actual computation on the head node, as it is a shared resource.

## Ray

[Ray](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html) provides the underlying
Expand Down Expand Up @@ -89,197 +74,6 @@ Jobs should still follow these principles for preemptible compute:
- **Checkpointable**: Write to GCS frequently, use small atomic units of work
- **Streaming**: Avoid materializing entire datasets in memory

---

# Maintaining a Ray Cluster

## Setup

Install [gcloud](https://cloud.google.com/sdk/docs/install). On MacOS, you can download the CLI with `brew install gcloud-cli`.

You will also need to authenticate with GCP and set the default project.

```bash
gcloud auth login # follow instructions
gcloud auth application-default login # follow instructions
gcloud config set project hai-gcp-models
```


## Cluster Management

Each cluster config is in a separate file in the `infra` directory. These files are automatically generated by the
`scripts/ray/cluster.py` script, which reads the `infra/marin-cluster-template.yaml` file. **Do not edit the
`cluster.yaml` files directly**. Instead, edit the template file and run the script to update the cluster configs.
For short term testing, it's fine to edit the `cluster.yaml` directly, but remember to update the template file and
regenerate the configs before merging. Check in the generated configs to the repo.

### Cluster management tool

For most operations, you can use the cluster management tool at `scripts/ray/cluster.py`. You can find the documentation
in [scripts/ray/README.md]. Some sample commands:

```
uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml {start-cluster,stop-cluster,restart-cluster}
uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml {add-worker}
uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml dashboard
```

### Ray Commands

You can also use the Ray commands to directly manipulate clusters.


```bash
export CLUSTER=us-central2

# Launch the Cluster -- will automatically provision the head node, and start configuring the minimum number of workers
ray up -y infra/marin-$CLUSTER.yaml

# Kill the Cluster (takes a while to gracefully terminate nodes)
ray down -y infra/marin-$CLUSTER.yaml

# Monitor the Ray Autoscaler Logs (in case there are problems spinning up workers)
# =>> Note that `exec` lets you run arbitrary commands on the head node!
ray exec infra/marin-$CLUSTER.yaml "tail -n 100 -f /tmp/ray/session_latest/logs/monitor*"

# SSH into the Head Node
ray attach infra/marin-$CLUSTER.yaml
```

By default, each cluster is provisioned with a persistent `n2-standard-8` VM acting as
the head node. At any given time, there should be a minimum of 4 TPU `v4-8` VMs acting as workers, with an autoscaling
limit of 1024 VMs (so a maximum of 8 * 1024 = 8,192 total v4 cores or 4,096 v4 chips).

Each TPU `v4-8` VM actually has a surprising amount of CPUs and RAM (~240 CPUs, 200GB+ of RAM). However, Ray doesn't
actually do anything under the hood to ensure that a job is actually using the number of logical resources specified
(e.g., a job that on paper requests 1 CPU can use arbitrary cores/RAM). To mitigate this on the scheduler side,
the config file configures each worker with only 120 visible CPUs.

### Restarting the Cluster

#### Restart Policy

When you need to restart a cluster, follow this policy:

1. **Notify**: Post in the #infra Discord channel about your plan to restart the cluster
2. **Check for running jobs**: Check if there are any active jobs on the cluster
3. **Ping affected users** (optional): If there are running jobs and you plan to use `--preserve-jobs=0`, ping the relevant people who own those jobs and give them time to respond (e.g., 15-30 minutes)
4. **Proceed with restart**: After notification and any necessary waiting period, proceed with the restart

>[!NOTE]
>The job restoration logic (enabled by default with `--preserve-jobs=1`) works reliably in most cases. However, being considerate of other users' work is still important.

**When to restart**: Restarts are appropriate when:
- The cluster is in a broken state (e.g., workers not connecting)
- The autoscaler is not functioning properly
- Configuration changes require a fresh start

#### Common Restart Scenario

There is currently an error on the Ray autoscaler side with spot-TPU instances, where the Ray autoscaler is not able
to detect when spot-TPU instances are dead and as a result, we may be left in a state with just the head node and
no more spot-TPU worker instances starting up. When this state occurs, please message in the #infra Discord
that you are going to restart the cluster, and then run `uv run scripts/ray/cluster.py --config <config> restart-cluster`.

#### Restart Options

* **Job preservation**: By default, `--preserve-jobs=1` backs up running jobs and resubmits them after restart. For a completely clean slate, use `--preserve-jobs=0`:
```bash
uv run ./scripts/ray/cluster.py --config=infra/marin-us-central2.yaml restart-cluster --preserve-jobs=0
```
* **Reserved workers**: If there are any reserved workers on the cluster, see the instructions below, though in many cases the command above is all you need.

### Adding manual workers

Ray cannot automatically schedule TPUs using our reserved capacity. These must be added to the cluster manually.

```bash
export CLUSTER=us-east5-b-vllm
uv run scripts/ray/cluster.py --config infra/marin-us-east5-b-vllm.yaml add-worker v6e-8 --capacity reserved
```

Remember to:
1. Message in the #marin Discord channel before restarting
2. Wait for the cluster to fully initialize before running jobs
3. Be patient with the first job after restart as it may take ~10 minutes for workers to spin up

### Reconfiguring the Cluster

To reconfigure the cluster, you should generally use the `scripts/ray/cluster.py` script and the template
file `infra/marin-cluster-template.yaml` and not modify the `cluster.yaml` directly. This script will update all the
cluster configs in the `infra` directory with your changes.

In general, for additive operations like increasing the `max_workers` for autoscaling, you can just call `ray up`
against the already-running cluster. For larger changes, like changing the machine type of the workers, you should bring
the cluster down (`ray down`) and then bring it back up (`ray up`).

If you need to change something else about the cluster, e.g. if you're changing any of the initialization/setup
commands, it's best to bring the entire cluster down (`ray down`), *then edit the `marin-cluster-template.yaml`*, and
then bring the cluster back up (`ray up`); note that this will kill all VMs, including the head node.

#### Docker Image

If you need to make substantive changes to the machine software, you should change the Docker file at
`docker/marin/Dockerfile.cluster`. Then run `make cluster_docker` to rebuild the Docker image and push it to the
Google Artifact Registry. (Note that by default this will update the dockers for all clusters; if you only want to update it for one cluster, you can modify `CLUSTER_REPOS` variable in the Makefile). This will create a new image and a new tag, of the form
`"us-central2-docker.pkg.dev/hai-gcp-models/marin/marin_cluster:<TAG>"`. Tags can include the latest commit hash and the
date, for example:

```makefile
CLUSTER_REPOS = us-central2 europe-west4 us-west4
TAG_VERSIONS = latest $(shell git rev-parse --short HEAD) $(shell date -u +"%Y%m%d")
```

The `cluster_docker` command will handle creating artifact repositories if they don't exist, building the Docker image,
tagging it, and pushing it to all relevant regions and versions. If you run into a permissions error (e.g., 403) when pushing the Docker image, you may need to authenticate to the repo:

``` diff
gcloud auth configure-docker us-central2-docker.pkg.dev
gcloud auth configure-docker europe-west4-docker.pkg.dev
gcloud auth configure-docker us-west4-docker.pkg.dev
```

After building the Docker image and pushing it to the relevant regions and versions, you need to update the
Ray configuration files to point to the latest version. `make cluster_docker` should update the `LATEST` tag
in `src/main/cluster/config.py` for you, but you can check just in case.

Run `uv run scripts/ray/cluster.py update-configs` to regenerate the cluster configs. This will update each cluster
config in the `infra` directory with the corresponding new Docker image tag.

After that, you can restart each cluster with `ray down` and `ray up`.

**If you use a cluster, please use the corresponding bucket, as data transfer costs between regions are high**

If you need to change something else about the cluster, e.g. if you're changing any of the initialization/setup
commands, it's best to bring the entire cluster down (`ray down`), *then edit the `cluster.yaml`*, and then bring the
cluster back up (`ray up`); note that this will kill all VMs, including the head node (nothing lasts forever).

**Word of Warning**: Ray looks at the actual `cluster_name` and various worker names/configs to "identify" existing/new
clusters. To prevent orphaned states, do not change the names of the clusters without first bringing
the cluster down!


#### Environment Variables

We are currently using Google Secret Manager to store the environment variables that are needed to run the cluster.
You can edit those secrets by going to the Google Cloud Console and navigating to the Secret Manager. Once you add
a new version, you can cause the changes to propagate by killing the workers or restarting the cluster.


### Adding TPU Nodes Manually to the Cluster

Ray only supports on demand and preemptible TPUs. For reserved nodes, we need to add them manually to the cluster.

The unified cluster manager provides this functionality:

```bash
# Add a reserved TPU worker (functionality consolidated from manual_ray_worker_launch.py)
uv run scripts/ray/cluster.py --config infra/marin-us-central2.yaml add-worker v4-128 --capacity reserved
```

**Note**: This functionality is currently being integrated. Contact the team for assistance with adding reserved workers if needed.

## Artifact Registry Cleanup Policy Management

To keep our Docker artifact registries tidy, we provide a script and Makefile target to automatically configure a cleanup policy for all our standard GCP regions. This policy deletes images older than 30 days from the registry,
Expand Down Expand Up @@ -308,5 +102,3 @@ except we keep the most recent 16 tags.
- After creating new Artifact Registry repositories in new regions.
- Periodically, to ensure all regions have the correct cleanup policy applied.
- After onboarding a new GCP project or changing repository names.

```
Loading
Loading