Skip to content

infra: delete Ray cluster templates and retire Ray-cluster docs#5132

Merged
yonromai merged 2 commits intomainfrom
remove-ray-cluster-yamls
Apr 23, 2026
Merged

infra: delete Ray cluster templates and retire Ray-cluster docs#5132
yonromai merged 2 commits intomainfrom
remove-ray-cluster-yamls

Conversation

@yonromai
Copy link
Copy Markdown
Contributor

Part of the Ray retirement umbrella (#4453). Implements stage 6 of ray_removal_analysis.md: cluster templates.

Summary

  • Deleted 16 live Ray cluster configs (infra/marin-{big-run,eu-west4,eu-west4-a,eu-west4-vllm,us-central1,us-central1-vllm,us-central2,us-central2-staging,us-central2-vllm,us-east1,us-east1-d-vllm,us-east5,us-east5-a,us-east5-a-vllm,us-east5-b-vllm,us-west4}.yaml).
  • Deleted the two generator templates (marin-cluster-template.yaml, marin-vllm-template.yaml).
  • Stripped the Our Cluster and Maintaining a Ray Cluster sections from infra/README.md; kept the Artifact Registry Cleanup Policy Management section (Iris clusters use the same registry).
  • Net: 19 files changed, 4,049 lines removed.

Ordering / dependencies

This PR assumes the per-cluster ray down teardown (stage 7) has already been performed. Per the plan, the cluster retires this week or next — ordering stays code-first, cluster-last.

Known residual references (out of scope)

rg 'infra/marin-' lib/ scripts/ docs/ .agents/ .github/ still returns hits in:

  • lib/marin/src/marin/cluster/config.py — scheduled for stage 5.
  • lib/fray/src/fray/v1/cluster/ray/config.py — scheduled for stage 3f.
  • .agents/projects/linear_ce_loss.md, .agents/projects/vllm-docker.md — historical logbooks the plan explicitly defers.
  • infra/marin-tmux.sh (not in the verify path, but now dead) — will go with stage 5 / stage 7.

These were identified during verification and match the staging called out in the plan.

Test plan

  • ./infra/pre-commit.py infra/README.md passes.
  • rg 'infra/marin-' lib/ scripts/ docs/ .agents/ .github/ returns only the residuals listed above (all scheduled for other stages).
  • infra/README.md retains Artifact Registry section and renders cleanly.

Part of the Ray retirement umbrella (#4453): stage 6 of
ray_removal_analysis.md. Deletes all 16 live cluster configs plus the
two generator templates (marin-cluster-template.yaml,
marin-vllm-template.yaml), and strips the "Our Cluster" and "Maintaining
a Ray Cluster" sections from infra/README.md. The Artifact Registry
cleanup section is preserved because Iris clusters also use that
registry.

The only remaining in-tree references to these YAML paths are in code
scheduled for deletion in stage 5 (lib/marin/src/marin/cluster/config.py)
and stage 3f (lib/fray/src/fray/v1/cluster/ray/config.py), plus
historical logbooks under .agents/projects/ that the plan explicitly
defers. infra/marin-tmux.sh still references these paths and is also
covered by stage 5 / stage 7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yonromai yonromai added the agent-generated Created by automation/agent label Apr 23, 2026
Standalone Ray-cluster tmux helper with zero in-repo references. Dead
weight now that the cluster configs and wrapper are gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yonromai yonromai marked this pull request as ready for review April 23, 2026 18:41
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fbeeb1e3f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@@ -1,253 +0,0 @@
#####################################################
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep region alias cluster configs available

Removing all infra/marin-*.yaml files makes both fray.v1.cluster.ray.config.find_config_by_region and marin.cluster.config.find_config_by_region fail for every region alias, because those helpers only resolve to files under this naming pattern and otherwise raise FileNotFoundError. Any existing Ray entrypoint using region-based cluster selection (for example Fray specs that pass cluster=<region>) now hard-fails at startup instead of connecting.

Useful? React with 👍 / 👎.

@@ -1,130 +0,0 @@
# Unique Identifier for the Head Node + Workers
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep template files until generator APIs are retired

Deleting marin-cluster-template.yaml (and the vLLM template in the same change) breaks marin.cluster.config.update_cluster_configs(): it calls get_template_path() for each entry in CONFIGS, which now raises FileNotFoundError immediately because the expected template no longer exists. This turns config regeneration from a working maintenance path into a runtime error unless the generator API/callers are removed in the same commit.

Useful? React with 👍 / 👎.

@yonromai yonromai merged commit 03e0b6e into main Apr 23, 2026
44 checks passed
@yonromai yonromai deleted the remove-ray-cluster-yamls branch April 23, 2026 20:49
yonromai added a commit that referenced this pull request Apr 23, 2026
## Summary

Stage 3f of the Ray removal (umbrella #4453). Deletes the legacy
Ray-backed `fray.v1` execution layer in its entirety. `fray.v1` is
orphaned:

- Earlier code stages (#5131, #5132) removed every external consumer of
`fray.v1.*` across `lib/marin`, `lib/levanter`, `experiments/`,
`tests/`, and docs.
- Stage 7 GCP teardown on 2026-04-23 destroyed the 9 non-`marin-big-run`
Ray head VMs and 34 firewall rules, so no live infrastructure targets
the v1 code path.

## What this PR deletes

- `lib/fray/src/fray/v1/**` — cluster, cluster/ray/*, job/context,
isolated_env, queue, fn_thunk, cli (23 files, ~6.7k LOC)
-
`lib/fray/tests/{conftest,test_cluster,test_queue,test_isolated_env,test_job_context,test_device_flops}.py`
— v1-only tests
- `[project.scripts] fray = "fray.v1.cli:main"` entrypoint in
`lib/fray/pyproject.toml`
- 10 v1 entries in `.pyrefly-baseline.json`
- Residual v1 references in `lib/fray/src/fray/__init__.py` docstring,
`lib/fray/src/fray/cluster/__init__.py` docstring, and
`lib/fray/AGENTS.md`

## What stays

- `fray.v2` (production API) — untouched
- `fray.cluster` — still a v2 re-export shim; wide external use via
`from fray.cluster import ResourceConfig`
- `ray==2.54.0` optional dep + `ray[default]` in `fray_test` group —
kept for `fray.v2.ray_backend`, deferred to stage 3g

## Verification

- `./infra/pre-commit.py --all-files --fix` → OK
- `uv run pyrefly check`: 150 errors (pre-commit filters via baseline);
origin/main reports 163; the 13-error drop matches v1 code we removed,
no new unsuppressed errors introduced
- `uv run pytest lib/fray/tests -x --timeout=60` → 60 passed (all
remaining v2 tests)

## Test plan

- [x] pre-commit.py all-files passes
- [x] pyrefly baseline stays clean
- [x] fray v2 test suite passes
- [ ] CI green before merge

## Next steps

After this merges, the remaining roadmap for #4453:

1. **Stage 3g** — drop the Ray backend from `fray.v2` (remove
`fray.v2.ray_backend`, delete `ray==2.54.0` dep + `ray[default]` in
`fray_test` group).
2. **Stage 3i** — rename `fray.v2.*` → `fray.*` once v2 is the only
backend.
3. **GCP §2 + §3** (parked on `marin-big-run` retirement): delete 6
`RAY_*` secrets + 163 `marin_cluster*` artifact-registry digests across
6 regions. See audit log on #4453.
4. **Close #4453** once 3g, 3i, §2, and §3 are done.

Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant