Skip to content

Remove dead Ray helpers from marin.cluster#5131

Merged
yonromai merged 2 commits intomainfrom
20260423-cluster-cleanup
Apr 23, 2026
Merged

Remove dead Ray helpers from marin.cluster#5131
yonromai merged 2 commits intomainfrom
20260423-cluster-cleanup

Conversation

@yonromai
Copy link
Copy Markdown
Contributor

Summary

Follow-up cleanup to the Ray-removal effort (parent #4453). Flagged by #5089's agent during operator-tooling review.

  • Delete lib/marin/src/marin/cluster/config.py entirely. Its only live caller was ray_run.py, removed in Delete ray_run.py #5087.
  • Trim Ray-only helpers from lib/marin/src/marin/cluster/gcp.py: terminate_tpus_in_cluster, terminate_head_node, delete_tpu_node.
  • Keep the generic gcloud helpers (run_gcloud_command, get_project_id, get_default_zone, list_instances, list_tpu_nodes, find_tpu_by_ip, find_vm_by_ip, ssh_to_vm, ssh_to_tpu) — used by scripts/iris/dev_tpu.py (backs the dev-tpu skill) and scripts/gcp-ssh.
  • Clean stale RayClusterConfig / update_cluster_configs exports from lib/marin/src/marin/cluster/__init__.py.

Total: 535 LOC removed, no new code.

Audit

Performed before deletion on origin/main (HEAD e10e14055).

Symbol External callers Action
marin.cluster.config module 0 (grep marin\.cluster\.config|from marin\.cluster import config|cluster\.config\.) delete
RayClusterConfig 0 outside cluster/__init__.py and cluster/config.py stale export, remove
update_cluster_configs 0 outside cluster/__init__.py and cluster/config.py stale export, remove
terminate_tpus_in_cluster 0 outside gcp.py itself delete
terminate_head_node 0 outside gcp.py itself delete
delete_tpu_node 1, and it's terminate_tpus_in_cluster inside gcp.py (line 229) delete
from marin.cluster import gcp 2: scripts/iris/dev_tpu.py, scripts/gcp-ssh keep module

Post-deletion grep: zero hits for all deleted names in the tree, and the two marin.cluster.gcp importers are intact.

Notes on unused-but-kept helpers

These generic helpers have zero external callers right now. They are retained per review instructions — the user will decide whether to prune them:

  • get_default_zone: not called anywhere in marin (levanter has its own copy at lib/levanter/src/levanter/infra/cli_helpers.py:78).
  • list_instances: its only internal caller was terminate_head_node (now deleted). No external callers.

All other "keep" helpers have live callers either externally (scripts) or internally (run_gcloud_command, list_tpu_nodes used by find_tpu_by_ip).

Test plan

  • rg 'marin\.cluster\.config\|from marin\.cluster import config\|cluster\.config\.' -> 0 hits
  • rg 'terminate_tpus_in_cluster\|terminate_head_node\|delete_tpu_node' -> 0 hits
  • rg 'from marin\.cluster import gcp\|marin\.cluster\.gcp' -> 2 hits, both in scripts/
  • rg 'RayClusterConfig\|update_cluster_configs' -> 0 hits
  • uv run scripts/iris/dev_tpu.py --help runs cleanly (lists allocate/connect/execute/release/setup_env/status/watch)
  • uv run python -c 'import marin.cluster; import marin.cluster.gcp' -> imports ok
  • ./infra/pre-commit.py --all-files --fix passes (ruff, black, license headers, pyrefly, AST, merges, TOML/YAML, trailing whitespace, EOF newlines, notebooks, markdown)

Follow-up to #5087 (ray_run.py removal) and #5089 (operator tooling).
Both landed, leaving residue in lib/marin/src/marin/cluster/:

- config.py: only live caller was ray_run.py; delete entirely.
- gcp.py: trim Ray-only helpers (terminate_tpus_in_cluster,
  terminate_head_node, delete_tpu_node). Keep the generic gcloud
  helpers used by scripts/iris/dev_tpu.py and scripts/gcp-ssh.

Also clean stale RayClusterConfig/update_cluster_configs exports
from lib/marin/src/marin/cluster/__init__.py.

Refs #4453.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yonromai yonromai added the agent-generated Created by automation/agent label Apr 23, 2026
Both had zero live callers after the Ray helper removal: get_default_zone
was never referenced in-repo (levanter has its own copy at
cli_helpers.py:78), and list_instances was used only by the already-deleted
terminate_head_node.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yonromai yonromai marked this pull request as ready for review April 23, 2026 18:41
@yonromai yonromai merged commit 003de39 into main Apr 23, 2026
38 checks passed
@yonromai yonromai deleted the 20260423-cluster-cleanup branch April 23, 2026 20:46
yonromai added a commit that referenced this pull request Apr 23, 2026
## Summary

Stage 3f of the Ray removal (umbrella #4453). Deletes the legacy
Ray-backed `fray.v1` execution layer in its entirety. `fray.v1` is
orphaned:

- Earlier code stages (#5131, #5132) removed every external consumer of
`fray.v1.*` across `lib/marin`, `lib/levanter`, `experiments/`,
`tests/`, and docs.
- Stage 7 GCP teardown on 2026-04-23 destroyed the 9 non-`marin-big-run`
Ray head VMs and 34 firewall rules, so no live infrastructure targets
the v1 code path.

## What this PR deletes

- `lib/fray/src/fray/v1/**` — cluster, cluster/ray/*, job/context,
isolated_env, queue, fn_thunk, cli (23 files, ~6.7k LOC)
-
`lib/fray/tests/{conftest,test_cluster,test_queue,test_isolated_env,test_job_context,test_device_flops}.py`
— v1-only tests
- `[project.scripts] fray = "fray.v1.cli:main"` entrypoint in
`lib/fray/pyproject.toml`
- 10 v1 entries in `.pyrefly-baseline.json`
- Residual v1 references in `lib/fray/src/fray/__init__.py` docstring,
`lib/fray/src/fray/cluster/__init__.py` docstring, and
`lib/fray/AGENTS.md`

## What stays

- `fray.v2` (production API) — untouched
- `fray.cluster` — still a v2 re-export shim; wide external use via
`from fray.cluster import ResourceConfig`
- `ray==2.54.0` optional dep + `ray[default]` in `fray_test` group —
kept for `fray.v2.ray_backend`, deferred to stage 3g

## Verification

- `./infra/pre-commit.py --all-files --fix` → OK
- `uv run pyrefly check`: 150 errors (pre-commit filters via baseline);
origin/main reports 163; the 13-error drop matches v1 code we removed,
no new unsuppressed errors introduced
- `uv run pytest lib/fray/tests -x --timeout=60` → 60 passed (all
remaining v2 tests)

## Test plan

- [x] pre-commit.py all-files passes
- [x] pyrefly baseline stays clean
- [x] fray v2 test suite passes
- [ ] CI green before merge

## Next steps

After this merges, the remaining roadmap for #4453:

1. **Stage 3g** — drop the Ray backend from `fray.v2` (remove
`fray.v2.ray_backend`, delete `ray==2.54.0` dep + `ray[default]` in
`fray_test` group).
2. **Stage 3i** — rename `fray.v2.*` → `fray.*` once v2 is the only
backend.
3. **GCP §2 + §3** (parked on `marin-big-run` retirement): delete 6
`RAY_*` secrets + 163 `marin_cluster*` artifact-registry digests across
6 regions. See audit log on #4453.
4. **Close #4453** once 3g, 3i, §2, and §3 are done.

Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant