infra: delete Ray cluster templates and retire Ray-cluster docs#5132
infra: delete Ray cluster templates and retire Ray-cluster docs#5132
Conversation
Part of the Ray retirement umbrella (#4453): stage 6 of ray_removal_analysis.md. Deletes all 16 live cluster configs plus the two generator templates (marin-cluster-template.yaml, marin-vllm-template.yaml), and strips the "Our Cluster" and "Maintaining a Ray Cluster" sections from infra/README.md. The Artifact Registry cleanup section is preserved because Iris clusters also use that registry. The only remaining in-tree references to these YAML paths are in code scheduled for deletion in stage 5 (lib/marin/src/marin/cluster/config.py) and stage 3f (lib/fray/src/fray/v1/cluster/ray/config.py), plus historical logbooks under .agents/projects/ that the plan explicitly defers. infra/marin-tmux.sh still references these paths and is also covered by stage 5 / stage 7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone Ray-cluster tmux helper with zero in-repo references. Dead weight now that the cluster configs and wrapper are gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fbeeb1e3f2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| @@ -1,253 +0,0 @@ | |||
| ##################################################### | |||
There was a problem hiding this comment.
Keep region alias cluster configs available
Removing all infra/marin-*.yaml files makes both fray.v1.cluster.ray.config.find_config_by_region and marin.cluster.config.find_config_by_region fail for every region alias, because those helpers only resolve to files under this naming pattern and otherwise raise FileNotFoundError. Any existing Ray entrypoint using region-based cluster selection (for example Fray specs that pass cluster=<region>) now hard-fails at startup instead of connecting.
Useful? React with 👍 / 👎.
| @@ -1,130 +0,0 @@ | |||
| # Unique Identifier for the Head Node + Workers | |||
There was a problem hiding this comment.
Keep template files until generator APIs are retired
Deleting marin-cluster-template.yaml (and the vLLM template in the same change) breaks marin.cluster.config.update_cluster_configs(): it calls get_template_path() for each entry in CONFIGS, which now raises FileNotFoundError immediately because the expected template no longer exists. This turns config regeneration from a working maintenance path into a runtime error unless the generator API/callers are removed in the same commit.
Useful? React with 👍 / 👎.
## Summary Stage 3f of the Ray removal (umbrella #4453). Deletes the legacy Ray-backed `fray.v1` execution layer in its entirety. `fray.v1` is orphaned: - Earlier code stages (#5131, #5132) removed every external consumer of `fray.v1.*` across `lib/marin`, `lib/levanter`, `experiments/`, `tests/`, and docs. - Stage 7 GCP teardown on 2026-04-23 destroyed the 9 non-`marin-big-run` Ray head VMs and 34 firewall rules, so no live infrastructure targets the v1 code path. ## What this PR deletes - `lib/fray/src/fray/v1/**` — cluster, cluster/ray/*, job/context, isolated_env, queue, fn_thunk, cli (23 files, ~6.7k LOC) - `lib/fray/tests/{conftest,test_cluster,test_queue,test_isolated_env,test_job_context,test_device_flops}.py` — v1-only tests - `[project.scripts] fray = "fray.v1.cli:main"` entrypoint in `lib/fray/pyproject.toml` - 10 v1 entries in `.pyrefly-baseline.json` - Residual v1 references in `lib/fray/src/fray/__init__.py` docstring, `lib/fray/src/fray/cluster/__init__.py` docstring, and `lib/fray/AGENTS.md` ## What stays - `fray.v2` (production API) — untouched - `fray.cluster` — still a v2 re-export shim; wide external use via `from fray.cluster import ResourceConfig` - `ray==2.54.0` optional dep + `ray[default]` in `fray_test` group — kept for `fray.v2.ray_backend`, deferred to stage 3g ## Verification - `./infra/pre-commit.py --all-files --fix` → OK - `uv run pyrefly check`: 150 errors (pre-commit filters via baseline); origin/main reports 163; the 13-error drop matches v1 code we removed, no new unsuppressed errors introduced - `uv run pytest lib/fray/tests -x --timeout=60` → 60 passed (all remaining v2 tests) ## Test plan - [x] pre-commit.py all-files passes - [x] pyrefly baseline stays clean - [x] fray v2 test suite passes - [ ] CI green before merge ## Next steps After this merges, the remaining roadmap for #4453: 1. **Stage 3g** — drop the Ray backend from `fray.v2` (remove `fray.v2.ray_backend`, delete `ray==2.54.0` dep + `ray[default]` in `fray_test` group). 2. **Stage 3i** — rename `fray.v2.*` → `fray.*` once v2 is the only backend. 3. **GCP §2 + §3** (parked on `marin-big-run` retirement): delete 6 `RAY_*` secrets + 163 `marin_cluster*` artifact-registry digests across 6 regions. See audit log on #4453. 4. **Close #4453** once 3g, 3i, §2, and §3 are done. Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
Part of the Ray retirement umbrella (#4453). Implements stage 6 of
ray_removal_analysis.md: cluster templates.Summary
infra/marin-{big-run,eu-west4,eu-west4-a,eu-west4-vllm,us-central1,us-central1-vllm,us-central2,us-central2-staging,us-central2-vllm,us-east1,us-east1-d-vllm,us-east5,us-east5-a,us-east5-a-vllm,us-east5-b-vllm,us-west4}.yaml).marin-cluster-template.yaml,marin-vllm-template.yaml).Our ClusterandMaintaining a Ray Clustersections frominfra/README.md; kept theArtifact Registry Cleanup Policy Managementsection (Iris clusters use the same registry).Ordering / dependencies
This PR assumes the per-cluster
ray downteardown (stage 7) has already been performed. Per the plan, the cluster retires this week or next — ordering stays code-first, cluster-last.Known residual references (out of scope)
rg 'infra/marin-' lib/ scripts/ docs/ .agents/ .github/still returns hits in:lib/marin/src/marin/cluster/config.py— scheduled for stage 5.lib/fray/src/fray/v1/cluster/ray/config.py— scheduled for stage 3f..agents/projects/linear_ce_loss.md,.agents/projects/vllm-docker.md— historical logbooks the plan explicitly defers.infra/marin-tmux.sh(not in the verify path, but now dead) — will go with stage 5 / stage 7.These were identified during verification and match the staging called out in the plan.
Test plan
./infra/pre-commit.py infra/README.mdpasses.rg 'infra/marin-' lib/ scripts/ docs/ .agents/ .github/returns only the residuals listed above (all scheduled for other stages).infra/README.mdretains Artifact Registry section and renders cleanly.