Remove dead Ray helpers from marin.cluster#5131
Merged
Conversation
Follow-up to #5087 (ray_run.py removal) and #5089 (operator tooling). Both landed, leaving residue in lib/marin/src/marin/cluster/: - config.py: only live caller was ray_run.py; delete entirely. - gcp.py: trim Ray-only helpers (terminate_tpus_in_cluster, terminate_head_node, delete_tpu_node). Keep the generic gcloud helpers used by scripts/iris/dev_tpu.py and scripts/gcp-ssh. Also clean stale RayClusterConfig/update_cluster_configs exports from lib/marin/src/marin/cluster/__init__.py. Refs #4453. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both had zero live callers after the Ray helper removal: get_default_zone was never referenced in-repo (levanter has its own copy at cli_helpers.py:78), and list_instances was used only by the already-deleted terminate_head_node. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
4 tasks
yonromai
added a commit
that referenced
this pull request
Apr 23, 2026
## Summary Stage 3f of the Ray removal (umbrella #4453). Deletes the legacy Ray-backed `fray.v1` execution layer in its entirety. `fray.v1` is orphaned: - Earlier code stages (#5131, #5132) removed every external consumer of `fray.v1.*` across `lib/marin`, `lib/levanter`, `experiments/`, `tests/`, and docs. - Stage 7 GCP teardown on 2026-04-23 destroyed the 9 non-`marin-big-run` Ray head VMs and 34 firewall rules, so no live infrastructure targets the v1 code path. ## What this PR deletes - `lib/fray/src/fray/v1/**` — cluster, cluster/ray/*, job/context, isolated_env, queue, fn_thunk, cli (23 files, ~6.7k LOC) - `lib/fray/tests/{conftest,test_cluster,test_queue,test_isolated_env,test_job_context,test_device_flops}.py` — v1-only tests - `[project.scripts] fray = "fray.v1.cli:main"` entrypoint in `lib/fray/pyproject.toml` - 10 v1 entries in `.pyrefly-baseline.json` - Residual v1 references in `lib/fray/src/fray/__init__.py` docstring, `lib/fray/src/fray/cluster/__init__.py` docstring, and `lib/fray/AGENTS.md` ## What stays - `fray.v2` (production API) — untouched - `fray.cluster` — still a v2 re-export shim; wide external use via `from fray.cluster import ResourceConfig` - `ray==2.54.0` optional dep + `ray[default]` in `fray_test` group — kept for `fray.v2.ray_backend`, deferred to stage 3g ## Verification - `./infra/pre-commit.py --all-files --fix` → OK - `uv run pyrefly check`: 150 errors (pre-commit filters via baseline); origin/main reports 163; the 13-error drop matches v1 code we removed, no new unsuppressed errors introduced - `uv run pytest lib/fray/tests -x --timeout=60` → 60 passed (all remaining v2 tests) ## Test plan - [x] pre-commit.py all-files passes - [x] pyrefly baseline stays clean - [x] fray v2 test suite passes - [ ] CI green before merge ## Next steps After this merges, the remaining roadmap for #4453: 1. **Stage 3g** — drop the Ray backend from `fray.v2` (remove `fray.v2.ray_backend`, delete `ray==2.54.0` dep + `ray[default]` in `fray_test` group). 2. **Stage 3i** — rename `fray.v2.*` → `fray.*` once v2 is the only backend. 3. **GCP §2 + §3** (parked on `marin-big-run` retirement): delete 6 `RAY_*` secrets + 163 `marin_cluster*` artifact-registry digests across 6 regions. See audit log on #4453. 4. **Close #4453** once 3g, 3i, §2, and §3 are done. Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up cleanup to the Ray-removal effort (parent #4453). Flagged by #5089's agent during operator-tooling review.
lib/marin/src/marin/cluster/config.pyentirely. Its only live caller wasray_run.py, removed in Delete ray_run.py #5087.lib/marin/src/marin/cluster/gcp.py:terminate_tpus_in_cluster,terminate_head_node,delete_tpu_node.run_gcloud_command,get_project_id,get_default_zone,list_instances,list_tpu_nodes,find_tpu_by_ip,find_vm_by_ip,ssh_to_vm,ssh_to_tpu) — used byscripts/iris/dev_tpu.py(backs thedev-tpuskill) andscripts/gcp-ssh.RayClusterConfig/update_cluster_configsexports fromlib/marin/src/marin/cluster/__init__.py.Total: 535 LOC removed, no new code.
Audit
Performed before deletion on
origin/main(HEADe10e14055).marin.cluster.configmodulemarin\.cluster\.config|from marin\.cluster import config|cluster\.config\.)RayClusterConfigcluster/__init__.pyandcluster/config.pyupdate_cluster_configscluster/__init__.pyandcluster/config.pyterminate_tpus_in_clustergcp.pyitselfterminate_head_nodegcp.pyitselfdelete_tpu_nodeterminate_tpus_in_clusterinsidegcp.py(line 229)from marin.cluster import gcpscripts/iris/dev_tpu.py,scripts/gcp-sshPost-deletion grep: zero hits for all deleted names in the tree, and the two
marin.cluster.gcpimporters are intact.Notes on unused-but-kept helpers
These generic helpers have zero external callers right now. They are retained per review instructions — the user will decide whether to prune them:
get_default_zone: not called anywhere in marin (levanter has its own copy atlib/levanter/src/levanter/infra/cli_helpers.py:78).list_instances: its only internal caller wasterminate_head_node(now deleted). No external callers.All other "keep" helpers have live callers either externally (scripts) or internally (
run_gcloud_command,list_tpu_nodesused byfind_tpu_by_ip).Test plan
rg 'marin\.cluster\.config\|from marin\.cluster import config\|cluster\.config\.'-> 0 hitsrg 'terminate_tpus_in_cluster\|terminate_head_node\|delete_tpu_node'-> 0 hitsrg 'from marin\.cluster import gcp\|marin\.cluster\.gcp'-> 2 hits, both inscripts/rg 'RayClusterConfig\|update_cluster_configs'-> 0 hitsuv run scripts/iris/dev_tpu.py --helpruns cleanly (lists allocate/connect/execute/release/setup_env/status/watch)uv run python -c 'import marin.cluster; import marin.cluster.gcp'-> imports ok./infra/pre-commit.py --all-files --fixpasses (ruff, black, license headers, pyrefly, AST, merges, TOML/YAML, trailing whitespace, EOF newlines, notebooks, markdown)