Conversation
Stage 5 of the Ray retirement plan (#4453). Ray operator scripts in scripts/ray/ are superseded by scripts/iris/; the cluster Dockerfile and its CI flow are no longer needed now that Iris handles cluster lifecycle. Deletes - scripts/ray/{cluster,dev_tpu,cleanup_workers}.py and scripts/ray/README.md - scripts/debug/inspect_data.py (Ray-only data-inspection CLI) - docker/marin/Dockerfile.cluster (based on rayproject/ray) - .github/workflows/marin-cleanup-tpus.yaml (invoked scripts/ray/cleanup_workers.py) - Marin cluster image job in .github/workflows/docker-images.yaml - Makefile cluster_docker* targets (built the above Dockerfile) Helpers moved - _normalize_cluster_region and _validate_data_region lifted from the deleted scripts/debug/inspect_data.py into their sole importer, tests/test_inspect_data_region.py. Not deleted this stage - lib/marin/src/marin/cluster/config.py still imported by lib/marin/src/marin/run/ray_run.py, which is queued for stage 3h. - lib/marin/src/marin/cluster/gcp.py still imported by scripts/gcp-ssh and scripts/iris/dev_tpu.py for the non-Ray gcloud helpers (get_project_id, find_tpu_by_ip, find_vm_by_ip, ssh_to_vm, ssh_to_tpu). Both files need a follow-up after stage 3h to split Ray-specific helpers from the generic GCP utilities. Verification - rg 'import ray' lib/ scripts/ returns only lib/fray/** hits. - uv sync (default) succeeds; uv sync --package marin --extra tpu resolves 597 packages (libtpu wheel platform mismatch on macOS is unrelated to this change). - ./infra/pre-commit.py --all-files --fix passes. - Migrated test_inspect_data_region.py: 7 passed in 0.27s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7362d660f6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| @@ -1,4 +1,4 @@ | |||
| .PHONY: help clean check fix cluster_docker cluster_docker_build cluster_docker_push setup_pre_commit rust-dev rust-user rust-status rust-package | |||
| .PHONY: help clean check fix setup_pre_commit rust-dev rust-user rust-status rust-package | |||
There was a problem hiding this comment.
Update Ray build docs when removing cluster_docker targets
Removing the cluster_docker*/cluster_tag Make targets breaks the documented operator flow immediately, because docs/dev-guide/rebuilding-cluster.md (lines 60-70) and infra/README.md (lines 224-247) still instruct users to run those targets and scripts/ray/cluster.py, which this commit also deletes. In practice this now yields make: *** No rule to make target ... and dead commands in runbooks, so the docs are no longer executable after this change; this also violates the root AGENTS.md rule to keep MkDocs docs in sync with code.
Useful? React with 👍 / 👎.
## Summary Follow-up cleanup to the Ray-removal effort (parent #4453). Flagged by #5089's agent during operator-tooling review. - Delete `lib/marin/src/marin/cluster/config.py` entirely. Its only live caller was `ray_run.py`, removed in #5087. - Trim Ray-only helpers from `lib/marin/src/marin/cluster/gcp.py`: `terminate_tpus_in_cluster`, `terminate_head_node`, `delete_tpu_node`. - Keep the generic gcloud helpers (`run_gcloud_command`, `get_project_id`, `get_default_zone`, `list_instances`, `list_tpu_nodes`, `find_tpu_by_ip`, `find_vm_by_ip`, `ssh_to_vm`, `ssh_to_tpu`) — used by `scripts/iris/dev_tpu.py` (backs the `dev-tpu` skill) and `scripts/gcp-ssh`. - Clean stale `RayClusterConfig` / `update_cluster_configs` exports from `lib/marin/src/marin/cluster/__init__.py`. Total: 535 LOC removed, no new code. ## Audit Performed before deletion on `origin/main` (HEAD `e10e14055`). | Symbol | External callers | Action | |---|---|---| | `marin.cluster.config` module | 0 (grep `marin\.cluster\.config\|from marin\.cluster import config\|cluster\.config\.`) | delete | | `RayClusterConfig` | 0 outside `cluster/__init__.py` and `cluster/config.py` | stale export, remove | | `update_cluster_configs` | 0 outside `cluster/__init__.py` and `cluster/config.py` | stale export, remove | | `terminate_tpus_in_cluster` | 0 outside `gcp.py` itself | delete | | `terminate_head_node` | 0 outside `gcp.py` itself | delete | | `delete_tpu_node` | 1, and it's `terminate_tpus_in_cluster` inside `gcp.py` (line 229) | delete | | `from marin.cluster import gcp` | 2: `scripts/iris/dev_tpu.py`, `scripts/gcp-ssh` | keep module | Post-deletion grep: zero hits for all deleted names in the tree, and the two `marin.cluster.gcp` importers are intact. ### Notes on unused-but-kept helpers These generic helpers have zero external callers right now. They are retained per review instructions — the user will decide whether to prune them: - `get_default_zone`: not called anywhere in marin (levanter has its own copy at `lib/levanter/src/levanter/infra/cli_helpers.py:78`). - `list_instances`: its only internal caller was `terminate_head_node` (now deleted). No external callers. All other "keep" helpers have live callers either externally (scripts) or internally (`run_gcloud_command`, `list_tpu_nodes` used by `find_tpu_by_ip`). ## Test plan - [x] `rg 'marin\.cluster\.config\|from marin\.cluster import config\|cluster\.config\.'` -> 0 hits - [x] `rg 'terminate_tpus_in_cluster\|terminate_head_node\|delete_tpu_node'` -> 0 hits - [x] `rg 'from marin\.cluster import gcp\|marin\.cluster\.gcp'` -> 2 hits, both in `scripts/` - [x] `rg 'RayClusterConfig\|update_cluster_configs'` -> 0 hits - [x] `uv run scripts/iris/dev_tpu.py --help` runs cleanly (lists allocate/connect/execute/release/setup_env/status/watch) - [x] `uv run python -c 'import marin.cluster; import marin.cluster.gcp'` -> imports ok - [x] `./infra/pre-commit.py --all-files --fix` passes (ruff, black, license headers, pyrefly, AST, merges, TOML/YAML, trailing whitespace, EOF newlines, notebooks, markdown) --------- Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
Stage 5 of the Ray retirement plan (#4453). Deletes the Ray operator scripts, the Ray cluster Dockerfile, and the GitHub Actions / Makefile plumbing that built and tested them. Colleague sign-off received.
Summary
Files deleted
scripts/ray/cluster.py(~1040 LOC) — Ray cluster lifecycle wrapper.scripts/ray/dev_tpu.py— superseded byscripts/iris/dev_tpu.py.scripts/ray/cleanup_workers.py— Ray-specific preempted-TPU cleanup.scripts/ray/README.md.scripts/debug/inspect_data.py— Ray-only data-inspection CLI.docker/marin/Dockerfile.cluster— based onrayproject/ray:2.53.0-py311-cpu, audit confirmed Ray-only..github/workflows/marin-cleanup-tpus.yaml— invokedscripts/ray/cleanup_workers.pyon cron.marin-cluster-imagesjob in.github/workflows/docker-images.yaml(build, smoke test, push, and the auto-PR that regenerated Ray cluster configs). Other jobs in the file (Iris, TPU CI, Levanter) are untouched.cluster_docker*/cluster_tagMakefile targets (and their.PHONYentries). These only fedDockerfile.cluster.Helpers moved (not duplicated)
scripts/debug/inspect_data.pydefined two small helpers —_normalize_cluster_regionand_validate_data_region— that were only imported bytests/test_inspect_data_region.py. Rather than find a new home for them, they move into the test file itself (their sole importer). No behavior change.from scripts.debug.inspect_data import _normalize_cluster_region, _validate_data_regiontests/test_inspect_data_region.py.Audit findings —
lib/marin/src/marin/cluster/{config,gcp}.pyNOT deletedThe plan's stage 5 table said these modules ""become dead once
scripts/ray/cluster.pyis gone"" and asked for a re-grep first. The re-grep turned up live callers outside Ray operator tooling, so per the plan's explicit ""stop and report"" directive I am leaving both files in place this stage.Concretely:
lib/marin/src/marin/cluster/config.py— still imported bylib/marin/src/marin/run/ray_run.py, which is the stage 3h target. Once 3h lands this file has no callers and can go.lib/marin/src/marin/cluster/gcp.py— mixed module. Ray-specific functions (terminate_tpus_in_cluster,terminate_head_node, both label-filtered onray-node-*) are dead after this PR, but the generic gcloud wrappers (get_project_id,find_tpu_by_ip,find_vm_by_ip,ssh_to_vm,ssh_to_tpu,list_instances,list_tpu_nodes,delete_tpu_node,run_gcloud_command) have active non-Ray consumers inscripts/gcp-sshandscripts/iris/dev_tpu.py. A follow-up should split the Ray-specific helpers out and keep the rest.I suggest a follow-up after 3h lands: delete
config.pyand splitgcp.pyinto Ray-only (deleted) and generic (kept, possibly relocated out of thecluster/package).Test plan
rg 'import ray' lib/ scripts/— onlylib/fray/**hits remain:uv sync— succeeds. The root package has notpuextra;uv sync --package marin --extra tpuresolves 597 packages on macOS but cannot installlibtpuwheels (no arm64 macOS wheels exist; unrelated to this change)../infra/pre-commit.py --all-files --fix— passes end-to-end: Ruff, Black, license headers, Pyrefly, YAML/TOML, trailing whitespace, EOF newline, Jupyter, large files, Python AST, merge conflicts, markdown.uv run --with pytest pytest tests/test_inspect_data_region.py -v—7 passed in 0.27s.Parent issue: #4453.