Skip to content

Delete Ray operator tooling#5089

Merged
rjpower merged 1 commit intomainfrom
20260422-stage5-operator-tooling
Apr 22, 2026
Merged

Delete Ray operator tooling#5089
rjpower merged 1 commit intomainfrom
20260422-stage5-operator-tooling

Conversation

@yonromai
Copy link
Copy Markdown
Contributor

Stage 5 of the Ray retirement plan (#4453). Deletes the Ray operator scripts, the Ray cluster Dockerfile, and the GitHub Actions / Makefile plumbing that built and tested them. Colleague sign-off received.

Summary

Files deleted

  • scripts/ray/cluster.py (~1040 LOC) — Ray cluster lifecycle wrapper.
  • scripts/ray/dev_tpu.py — superseded by scripts/iris/dev_tpu.py.
  • scripts/ray/cleanup_workers.py — Ray-specific preempted-TPU cleanup.
  • scripts/ray/README.md.
  • scripts/debug/inspect_data.py — Ray-only data-inspection CLI.
  • docker/marin/Dockerfile.cluster — based on rayproject/ray:2.53.0-py311-cpu, audit confirmed Ray-only.
  • .github/workflows/marin-cleanup-tpus.yaml — invoked scripts/ray/cleanup_workers.py on cron.
  • The marin-cluster-images job in .github/workflows/docker-images.yaml (build, smoke test, push, and the auto-PR that regenerated Ray cluster configs). Other jobs in the file (Iris, TPU CI, Levanter) are untouched.
  • cluster_docker* / cluster_tag Makefile targets (and their .PHONY entries). These only fed Dockerfile.cluster.

Helpers moved (not duplicated)

scripts/debug/inspect_data.py defined two small helpers — _normalize_cluster_region and _validate_data_region — that were only imported by tests/test_inspect_data_region.py. Rather than find a new home for them, they move into the test file itself (their sole importer). No behavior change.

  • Before: from scripts.debug.inspect_data import _normalize_cluster_region, _validate_data_region
  • After: helpers defined at the top of tests/test_inspect_data_region.py.

Audit findings — lib/marin/src/marin/cluster/{config,gcp}.py NOT deleted

The plan's stage 5 table said these modules ""become dead once scripts/ray/cluster.py is gone"" and asked for a re-grep first. The re-grep turned up live callers outside Ray operator tooling, so per the plan's explicit ""stop and report"" directive I am leaving both files in place this stage.

$ rg 'marin\.cluster\.(config|gcp)|from marin\.cluster import' lib/ experiments/ scripts/ tests/
scripts/debug/inspect_data.py              — DELETED this PR
scripts/ray/cluster.py                     — DELETED this PR
scripts/ray/dev_tpu.py                     — DELETED this PR
scripts/gcp-ssh                            — LIVE, uses marin.cluster.gcp (get_project_id, find_tpu_by_ip, find_vm_by_ip, ssh_to_vm, ssh_to_tpu)
scripts/iris/dev_tpu.py                    — LIVE, uses marin.cluster.gcp (get_project_id, find_tpu_by_ip, find_vm_by_ip)
lib/marin/src/marin/run/ray_run.py         — LIVE, uses marin.cluster.config.find_config_by_region; queued for deletion in stage 3h

Concretely:

  • lib/marin/src/marin/cluster/config.py — still imported by lib/marin/src/marin/run/ray_run.py, which is the stage 3h target. Once 3h lands this file has no callers and can go.
  • lib/marin/src/marin/cluster/gcp.py — mixed module. Ray-specific functions (terminate_tpus_in_cluster, terminate_head_node, both label-filtered on ray-node-*) are dead after this PR, but the generic gcloud wrappers (get_project_id, find_tpu_by_ip, find_vm_by_ip, ssh_to_vm, ssh_to_tpu, list_instances, list_tpu_nodes, delete_tpu_node, run_gcloud_command) have active non-Ray consumers in scripts/gcp-ssh and scripts/iris/dev_tpu.py. A follow-up should split the Ray-specific helpers out and keep the rest.

I suggest a follow-up after 3h lands: delete config.py and split gcp.py into Ray-only (deleted) and generic (kept, possibly relocated out of the cluster/ package).

Test plan

  • rg 'import ray' lib/ scripts/ — only lib/fray/** hits remain:
    lib/fray/src/fray/v1/cluster/ray/cluster.py
    lib/fray/src/fray/v1/cluster/ray/resources.py
    lib/fray/src/fray/v1/cluster/ray/tpu/execution.py
    lib/fray/src/fray/v1/job/context.py          (guarded)
    lib/fray/src/fray/v2/client.py               (guarded, line 192)
    lib/fray/src/fray/v2/ray_backend/backend.py
    lib/fray/src/fray/v2/ray_backend/context.py
    lib/fray/src/fray/v2/ray_backend/dashboard.py
    lib/fray/src/fray/v2/ray_backend/resources.py
    lib/fray/src/fray/v2/ray_backend/tpu.py
    
  • uv sync — succeeds. The root package has no tpu extra; uv sync --package marin --extra tpu resolves 597 packages on macOS but cannot install libtpu wheels (no arm64 macOS wheels exist; unrelated to this change).
  • ./infra/pre-commit.py --all-files --fix — passes end-to-end: Ruff, Black, license headers, Pyrefly, YAML/TOML, trailing whitespace, EOF newline, Jupyter, large files, Python AST, merge conflicts, markdown.
  • Migrated helpers still tested: uv run --with pytest pytest tests/test_inspect_data_region.py -v7 passed in 0.27s.

Parent issue: #4453.

Stage 5 of the Ray retirement plan (#4453). Ray operator scripts in
scripts/ray/ are superseded by scripts/iris/; the cluster Dockerfile
and its CI flow are no longer needed now that Iris handles cluster
lifecycle.

Deletes
- scripts/ray/{cluster,dev_tpu,cleanup_workers}.py and scripts/ray/README.md
- scripts/debug/inspect_data.py (Ray-only data-inspection CLI)
- docker/marin/Dockerfile.cluster (based on rayproject/ray)
- .github/workflows/marin-cleanup-tpus.yaml (invoked scripts/ray/cleanup_workers.py)
- Marin cluster image job in .github/workflows/docker-images.yaml
- Makefile cluster_docker* targets (built the above Dockerfile)

Helpers moved
- _normalize_cluster_region and _validate_data_region lifted from the
  deleted scripts/debug/inspect_data.py into their sole importer,
  tests/test_inspect_data_region.py.

Not deleted this stage
- lib/marin/src/marin/cluster/config.py still imported by
  lib/marin/src/marin/run/ray_run.py, which is queued for stage 3h.
- lib/marin/src/marin/cluster/gcp.py still imported by scripts/gcp-ssh
  and scripts/iris/dev_tpu.py for the non-Ray gcloud helpers
  (get_project_id, find_tpu_by_ip, find_vm_by_ip, ssh_to_vm, ssh_to_tpu).
  Both files need a follow-up after stage 3h to split Ray-specific
  helpers from the generic GCP utilities.

Verification
- rg 'import ray' lib/ scripts/ returns only lib/fray/** hits.
- uv sync (default) succeeds; uv sync --package marin --extra tpu
  resolves 597 packages (libtpu wheel platform mismatch on macOS is
  unrelated to this change).
- ./infra/pre-commit.py --all-files --fix passes.
- Migrated test_inspect_data_region.py: 7 passed in 0.27s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yonromai yonromai added the agent-generated Created by automation/agent label Apr 22, 2026
@yonromai yonromai marked this pull request as ready for review April 22, 2026 22:52
@yonromai yonromai requested review from dlwh and rjpower April 22, 2026 22:52
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7362d660f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Makefile
@@ -1,4 +1,4 @@
.PHONY: help clean check fix cluster_docker cluster_docker_build cluster_docker_push setup_pre_commit rust-dev rust-user rust-status rust-package
.PHONY: help clean check fix setup_pre_commit rust-dev rust-user rust-status rust-package
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Update Ray build docs when removing cluster_docker targets

Removing the cluster_docker*/cluster_tag Make targets breaks the documented operator flow immediately, because docs/dev-guide/rebuilding-cluster.md (lines 60-70) and infra/README.md (lines 224-247) still instruct users to run those targets and scripts/ray/cluster.py, which this commit also deletes. In practice this now yields make: *** No rule to make target ... and dead commands in runbooks, so the docs are no longer executable after this change; this also violates the root AGENTS.md rule to keep MkDocs docs in sync with code.

Useful? React with 👍 / 👎.

@rjpower rjpower merged commit b985908 into main Apr 22, 2026
44 checks passed
@rjpower rjpower deleted the 20260422-stage5-operator-tooling branch April 22, 2026 22:59
yonromai added a commit that referenced this pull request Apr 23, 2026
## Summary

Follow-up cleanup to the Ray-removal effort (parent #4453). Flagged by
#5089's agent during operator-tooling review.

- Delete `lib/marin/src/marin/cluster/config.py` entirely. Its only live
caller was `ray_run.py`, removed in #5087.
- Trim Ray-only helpers from `lib/marin/src/marin/cluster/gcp.py`:
`terminate_tpus_in_cluster`, `terminate_head_node`, `delete_tpu_node`.
- Keep the generic gcloud helpers (`run_gcloud_command`,
`get_project_id`, `get_default_zone`, `list_instances`,
`list_tpu_nodes`, `find_tpu_by_ip`, `find_vm_by_ip`, `ssh_to_vm`,
`ssh_to_tpu`) — used by `scripts/iris/dev_tpu.py` (backs the `dev-tpu`
skill) and `scripts/gcp-ssh`.
- Clean stale `RayClusterConfig` / `update_cluster_configs` exports from
`lib/marin/src/marin/cluster/__init__.py`.

Total: 535 LOC removed, no new code.

## Audit

Performed before deletion on `origin/main` (HEAD `e10e14055`).

| Symbol | External callers | Action |
|---|---|---|
| `marin.cluster.config` module | 0 (grep `marin\.cluster\.config\|from
marin\.cluster import config\|cluster\.config\.`) | delete |
| `RayClusterConfig` | 0 outside `cluster/__init__.py` and
`cluster/config.py` | stale export, remove |
| `update_cluster_configs` | 0 outside `cluster/__init__.py` and
`cluster/config.py` | stale export, remove |
| `terminate_tpus_in_cluster` | 0 outside `gcp.py` itself | delete |
| `terminate_head_node` | 0 outside `gcp.py` itself | delete |
| `delete_tpu_node` | 1, and it's `terminate_tpus_in_cluster` inside
`gcp.py` (line 229) | delete |
| `from marin.cluster import gcp` | 2: `scripts/iris/dev_tpu.py`,
`scripts/gcp-ssh` | keep module |

Post-deletion grep: zero hits for all deleted names in the tree, and the
two `marin.cluster.gcp` importers are intact.

### Notes on unused-but-kept helpers

These generic helpers have zero external callers right now. They are
retained per review instructions — the user will decide whether to prune
them:

- `get_default_zone`: not called anywhere in marin (levanter has its own
copy at `lib/levanter/src/levanter/infra/cli_helpers.py:78`).
- `list_instances`: its only internal caller was `terminate_head_node`
(now deleted). No external callers.

All other "keep" helpers have live callers either externally (scripts)
or internally (`run_gcloud_command`, `list_tpu_nodes` used by
`find_tpu_by_ip`).

## Test plan

- [x] `rg 'marin\.cluster\.config\|from marin\.cluster import
config\|cluster\.config\.'` -> 0 hits
- [x] `rg
'terminate_tpus_in_cluster\|terminate_head_node\|delete_tpu_node'` -> 0
hits
- [x] `rg 'from marin\.cluster import gcp\|marin\.cluster\.gcp'` -> 2
hits, both in `scripts/`
- [x] `rg 'RayClusterConfig\|update_cluster_configs'` -> 0 hits
- [x] `uv run scripts/iris/dev_tpu.py --help` runs cleanly (lists
allocate/connect/execute/release/setup_env/status/watch)
- [x] `uv run python -c 'import marin.cluster; import
marin.cluster.gcp'` -> imports ok
- [x] `./infra/pre-commit.py --all-files --fix` passes (ruff, black,
license headers, pyrefly, AST, merges, TOML/YAML, trailing whitespace,
EOF newlines, notebooks, markdown)

---------

Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants