Skip to content

Commit 7a56e01

Browse files
authored
No more Ray in Marin (#5138)
## Summary Stage 3g of the Ray removal plan (#4453). Deletes `fray.v2.ray_backend` (~3.1k LOC) and the Ray auto-detect branch in `fray.v2.client.current_client`. `fray.v2` is now Iris-only, with `LocalClient` as the fallback for tests/dev. Builds on stage 3f (#5137) which deleted `fray.v1`. ## What's deleted - `lib/fray/src/fray/v2/ray_backend/` — all 10 modules (`backend.py`, `tpu.py`, `dashboard.py`, `dashboard_proxy.py`, `deps.py`, `fn_thunk.py`, `resources.py`, `context.py`, `auth.py`, `__init__.py`). - Ray auto-detect branch in `fray.v2.client.current_client` (the `ray.is_initialized()` / `FRAY_CLUSTER_SPEC=ray` path and the `RayClient` import). Resolution order is now: explicit client → Iris auto-detect → `LocalClient` fallback. - Ray-flavored cases in `lib/fray/tests/test_v2_current_client.py` (`test_ray_auto_detection`, `test_ray_not_detected_when_not_initialized`, `test_iris_takes_priority_over_ray`) and the residual `patch("ray.is_initialized", ...)` calls. - `ray = ["ray==2.54.0"]` optional dep and the `ray[default]` / `pip # ray requires pip` lines from the `fray_test` group in `lib/fray/pyproject.toml`. - `marin-fray[ray]` in `lib/zephyr/pyproject.toml` becomes plain `marin-fray` (zephyr has no direct `import ray`, the extra is vestigial). - `ray==2.54.0` from `lib/marin/pyproject.toml` and `ray[default]==2.54.0` from `lib/levanter/pyproject.toml`. Both were dead direct deps (`rg '^import ray$|^from ray' lib/marin lib/levanter experiments` returns empty). Stale "avoid 7+ due to ray" comment on levanter's `protobuf>=6,<7` pin trimmed to just the TB/XProf reason. - Stale ``(Ray)``/``Ray's ...`` mentions and TODOs in `actor.py`, `types.py`, `local_backend.py`, and `lib/fray/AGENTS.md`. - `ray` rows in `uv.lock` (fray package + all three dependency groups + zephyr's transitive entry + marin/levanter direct edges). ## Post-audit cleanup Follow-up commit after a post-#5138 grep audit flagged five more stale Ray references: - `lib/levanter/docker/tpu/Dockerfile.cluster`: dropped the `ray[default,gcp]==2.34.0` install + `dlwh/ray` fork patch (HACK for ray-project/ray#47769), `RAY_USAGE_STATS_ENABLED`, and the stale "using Ray to manage TPU slices" header comment. File is still referenced by `.github/workflows/docker-images.yaml` and `lib/levanter/infra/cluster/push_cluster_docker.sh`, so kept. - `lib/levanter/docker/tpu/Dockerfile.incremental`: dropped `RAY_USAGE_STATS_ENABLED`. - `infra/README.md`: replaced the "## Ray" section (claiming Ray is Marin's cluster infra) with a terse pointer to Iris + fray + zephyr. - `docs/dev-guide/contributing.md`: dropped the obsolete "unset RAY_ADDRESS" guardrail for running unit tests. - `tests/test_dry_run.py`: dropped `os.environ["RAY_LOCAL_CLUSTER"] = "1"` (confirmed no remaining readers repo-wide post-#5138) and the now-unused `os` import. ## What's left - `fray.v2` Iris-only: `FrayIrisClient` / `IrisActorHandle` / `IrisActorGroup` unchanged. - `fray.cluster/__init__.py` (v2 re-export shim) untouched — has ~60 external call sites and its API is load-bearing. - No changes to `fray.v2` subpackage structure: rename to root is stage 3i. - `ray` survives in `uv.lock` as a transitive dep of `vllm-tpu` under marin's `vllm` extra. That's intentional: vllm-tpu pins ray itself, we no longer pin it on our side. ## Verification - [x] `./infra/pre-commit.py --all-files --fix` — OK. - [x] `uvx pyrefly@0.61.0 check --baseline .pyrefly-baseline.json` — 0 errors; baseline untouched (no `ray_backend` entries existed). - [x] `uv run pytest lib/fray/tests -x --timeout=60` — 57 passed. - [x] `uv lock` — clean re-resolve; `ray` removed from fray extras, from the zephyr transitive edge, and from the marin/levanter direct edges. Remains only as a vllm-tpu transitive. - [x] Repo-wide grep `ray_backend|fray\.v2\.ray|RayClient|FRAY_CLUSTER_SPEC` returns only archived `.agents/projects/*` design docs. ## Next steps - **Stage 3i**: rename `fray.v2.*` → `fray.*` (drop the `v2` subpackage). Tracking issue pending; unblocked by this PR. - **GCP §2 (marin_cluster* artifact-registry digests)** and **§3 (RAY_* secrets)** remain parked on `marin-big-run` Ray cluster retirement. - Once §2/§3 land, we can close the parent ticket **#4453**. --------- Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
1 parent 5c22792 commit 7a56e01

26 files changed

Lines changed: 54 additions & 3184 deletions

docs/dev-guide/contributing.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,6 @@ uv run pytest -m 'not slow' <relevant test paths>
3434

3535
Use `make test` when you need the full default test suite.
3636

37-
*Note* that to run the unit tests, you must not have set `RAY_ADDRESS`. You can unset it with `unset RAY_ADDRESS` or `export RAY_ADDRESS=""`.
38-
3937
### Opening a pull request
4038

4139
Before opening a pull request:

infra/README.md

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,19 +12,11 @@ We have several clusters for Marin, each with a different TPU type:
1212

1313

1414

15-
## Ray
15+
## Cluster Infrastructure
1616

17-
[Ray](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html) provides the underlying
18-
cluster infrastructure for Marin. We use Ray for:
19-
- **Cluster management**: Autoscaling, node provisioning, job scheduling
20-
- **Training**: Distributed model training via Levanter
21-
- **Inference**: GPU/TPU actor pools for model serving
22-
23-
For **data processing** (downloads, transforms, deduplication), we use Zephyr instead of raw Ray.
24-
25-
**Useful Documentation**:
26-
- [Ray Cluster](https://docs.ray.io/en/latest/cluster/key-concepts.html): Cluster architecture and key concepts
27-
- [Ray on GCP](https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/gcp.html): GCP-specific deployment
17+
Marin clusters run on [Iris](../lib/iris/README.md) for orchestration (job/task
18+
scheduling, node provisioning), fray for distributed execution (Iris-backed),
19+
and [zephyr](../lib/zephyr/README.md) for data pipelines.
2820

2921
## Preemptibility
3022

lib/fray/AGENTS.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ Distributed execution abstraction layer. Start with the shared instructions in `
1212
- `src/fray/v2/client.py``Client` protocol, `current_client()`, auto-detection
1313
- `src/fray/v2/types.py``JobRequest`, `ResourceConfig`, `DeviceConfig` (CPU/GPU/TPU)
1414
- `src/fray/v2/actor.py``ActorHandle`, `ActorGroup`, actor hosting
15-
- `src/fray/v2/ray_backend/` — Ray backend (`submit`, `host_actor`)
1615
- `src/fray/v2/iris_backend.py` — Iris backend
1716
- `src/fray/v2/local_backend.py` — Local/thread backend (testing)
1817
- `src/fray/v2/device_flops.py` — TPU/GPU flops calculation
@@ -23,4 +22,4 @@ Distributed execution abstraction layer. Start with the shared instructions in `
2322
- **v2 is the production API.** All new code should use `fray.v2`.
2423
- Always use the `Client` protocol, not concrete backend implementations.
2524
- Actor resources: set `num_cpus=0` on actors to avoid head-node resource contention.
26-
- Testing: use `LocalClient` for unit tests. Only use Ray/Iris backends for integration tests.
25+
- Testing: use `LocalClient` for unit tests. Only use the Iris backend for integration tests.

lib/fray/pyproject.toml

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,16 +24,11 @@ dependencies = [
2424
"zstandard>=0.22.0",
2525
]
2626

27-
[project.optional-dependencies]
28-
ray = ["ray==2.54.0"]
29-
3027
[dependency-groups]
3128
fray_test = [
3229
"numpy",
33-
"pip", # ray requires pip to be installed
3430
"pytest-timeout",
3531
"pytest>=8.3.2",
36-
"ray[default]"
3732
]
3833
fray_tpu_test = ["jax[tpu]", { include-group = "fray_test" }]
3934

lib/fray/src/fray/v2/actor.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,8 @@ def __getattr__(self, method_name: str) -> ActorMethod: ...
2626
class ActorContext:
2727
"""Context available to actors during execution.
2828
29-
``shutdown_event`` is set by the actor when it is ready to exit.
30-
Backends create the event and block on it (Iris) or use it to trigger
31-
``exit_actor()`` (Ray).
29+
``shutdown_event`` is set by the actor when it is ready to exit; the
30+
backend creates the event and blocks on it to tear the actor down.
3231
"""
3332

3433
handle: ActorHandle
@@ -108,7 +107,7 @@ def submit(self, *args: Any, **kwargs: Any) -> ActorFuture:
108107
mechanism (e.g. StartOperation + GetOperation RPCs) so that
109108
transient connection drops don't kill the call.
110109
111-
For local and Ray backends this is identical to remote().
110+
For the local backend this is identical to remote().
112111
"""
113112
...
114113

lib/fray/src/fray/v2/client.py

Lines changed: 1 addition & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -163,18 +163,14 @@ def current_client() -> Client:
163163
Resolution order:
164164
1. Explicitly set client (via set_current_client)
165165
2. Auto-detect Iris environment (get_iris_ctx() returns context)
166-
3. Auto-detect Ray environment (ray.is_initialized())
167-
4. LocalClient() default
166+
3. LocalClient() default
168167
"""
169168

170169
client = _current_client_var.get()
171170
if client is not None:
172171
logger.info("current_client: using explicitly set client")
173172
return client
174173

175-
import os
176-
177-
# Auto-detect Iris environment (takes priority over Ray)
178174
try:
179175
from iris.client.client import get_iris_ctx
180176

@@ -187,20 +183,6 @@ def current_client() -> Client:
187183
except ImportError:
188184
logger.warning("current_client: iris not installed")
189185

190-
# Auto-detect Ray environment
191-
try:
192-
import ray
193-
194-
logger.info("current_client: ray.is_initialized()=%s", ray.is_initialized())
195-
# surprisingly, Ray doesn't initialize the worker context by default, so check for the env var for v1 compat
196-
if ray.is_initialized() or os.environ.get("FRAY_CLUSTER_SPEC", "").startswith("ray"):
197-
from fray.v2.ray_backend.backend import RayClient
198-
199-
logger.info("current_client: using Ray backend (auto-detected)")
200-
return RayClient()
201-
except ImportError:
202-
logger.warning("current_client: ray not installed")
203-
204186
from fray.v2.local_backend import LocalClient
205187

206188
logger.info("current_client: using LocalClient (fallback)")

lib/fray/src/fray/v2/local_backend.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -225,8 +225,8 @@ class LocalActorHandle:
225225
This allows the handle to be created before the actor instance exists,
226226
enabling actors to access their own handle during __init__.
227227
228-
Actors are responsible for their own thread safety. This matches Iris/Ray
229-
behavior where actor methods can be called concurrently and the actor
228+
Actors are responsible for their own thread safety. This matches the Iris
229+
backend where actor methods can be called concurrently and the actor
230230
implementation must handle synchronization internally.
231231
"""
232232

lib/fray/src/fray/v2/ray_backend/__init__.py

Lines changed: 0 additions & 4 deletions
This file was deleted.

lib/fray/src/fray/v2/ray_backend/auth.py

Lines changed: 0 additions & 48 deletions
This file was deleted.

0 commit comments

Comments
 (0)