Migrate all supported Marin code off Ray

## Migrate all supported Marin code off Ray

We're deprecating Ray in favor of Iris/Fray v2 (#4269). Execution, training, and Zephyr are already on v2, but a long tail of code still depends on Ray — either directly (`import ray`) or transitively via Fray v1 (which is Ray-backed). This issue tracks migrating every supported codepath so we can turn off the Ray cluster.

### Ray entrypoints — code that needs migrating

These are library modules and workflows that consumers actively depend on. Each must be migrated to Fray v2 / Iris before Ray can be removed.

<details>
<summary>Direct <code>import ray</code> — needs architectural work (6 files)</summary>

| Area | Files | What they use Ray for |
|------|-------|-----------------------|
| **Classification pipeline** (3 files) | `processing/classification/inference.py`, `classifier.py`, `autoscaler.py` | `@ray.remote` actors, `ray.util.queue.Queue`, `RayActorError` — full actor-pool pattern for distributed inference |
| **Executor** (1 file) | `execution/executor.py` | `ray.remote_function.RemoteFunction` isinstance check + `ray.get()` — legacy support for `@ray.remote`-decorated executor steps |
| **Levanter distributed** (1 file) | `levanter/distributed.py` | `ray.init()`, `ray.shutdown()` — Levanter's distributed runtime init |
| **vLLM server** (1 file) | `inference/vllm_server.py` | `from ray._private.accelerators import TPUAcceleratorManager` in try/except — optional TPU detection |

</details>

<details>
<summary>Via Fray v1 — needs import migration + <code>current_cluster()</code> → <code>current_client()</code> rework (24 files)</summary>

| Area | Files | v1 APIs used |
|------|-------|-------------|
| **RL** (7 files) | `rl/rl_job.py`, `rl/train_worker.py`, `rl/rollout_worker.py`, `rl/curriculum.py`, `rl/weight_transfer/{arrow_flight,jax}.py`, `rl/scripts/evaluate_environment.py` | `current_cluster()`, `JobRequest`, `get_default_job_ctx()` |
| **Evaluation** (9 files) | `evaluation/evaluators/{evaluator,lm_evaluation_harness_evaluator,harbor_evaluator,simple_evaluator,levanter_tpu_evaluator,evalchemy_evaluator,levanter_lm_eval_evaluator}.py`, `evaluation/{visualize,run}.py` | `current_cluster()`, `JobRequest`, `build_runtime_env_for_packages` |
| **Processing** | `processing/classification/fasttext/train_fasttext.py` | `current_cluster()`, `JobRequest` |
| **Export** | `export/levanter_checkpoint.py` | `current_cluster()`, `JobRequest` |
| **Inference** | `inference/vllm_smoke_test.py` | `current_cluster()`, `JobRequest` |
| **Levanter** | `levanter/callbacks/_metrics.py` | `fray.v1.cluster.device_flops` (trivial — v2 equivalent exists at `fray.v2.device_flops`) |
| **Tests** | `tests/conftest.py`, `tests/transform/conftest.py`, `tests/datakit/download/conftest.py`, `tests/rl/test_weight_transfer.py`, `tests/rl/integration/test_iris_integration.py`, `tests/integration_test.py`, `lib/zephyr/tests/conftest.py` | `create_cluster`, `set_current_cluster`, `fray_default_job_ctx`, `ray.init()`, `@ray.remote` in test |

Note: `evaluation/run.py` imports both `fray.v1.cluster.ResourceConfig` and `fray.cluster.ResourceConfig` side by side with aliases — a hybrid state from incremental migration.

</details>

<details>
<summary>Ray orphans — delete, don't migrate (7 files)</summary>

These modules have **no importers in library code**. They're standalone CLI tools or dead code that will be deleted as part of #4269. No migration work needed.

| File | Why it's orphan |
|------|----------------|
| `run/ray_run.py` | CLI entry point only. `zephyr/cli.py` shells out to it for `zephyr ray submit`, but that's a deprecated codepath. |
| `scripts/ray/cluster.py` | Standalone CLI for Ray cluster ops. No importers. |
| `scripts/ray/dev_tpu.py` | Standalone CLI for dev TPU allocation. No importers. Replaced by `scripts/iris/dev_tpu.py`. |
| `levanter/infra/ray_tpu.py` | Only imported by `launch_on_ray.py` (below). |
| `levanter/infra/launch_on_ray.py` | Only imports `ray_tpu.py`. No external callers. Dead chain. |
| `marin/cluster/ray.py` | Only imported by `ray_run.py`. |
| `scripts/debug/inspect_data.py` | Debug script with `ray.init()`, `@ray.remote`, `JobSubmissionClient`. No importers. |

</details>

### v1 APIs with no obvious v2 equivalent

These need design work, not just mechanical import swaps:

- **`get_default_job_ctx()`** — used pervasively in RL workers for peer discovery and actor creation. v2 uses `current_client()` but the job-context pattern is structurally different.
- **`build_runtime_env_for_packages()`** — Ray-specific runtime env builder used in evaluators. Concept doesn't apply to Iris.
- **Direct `@ray.remote` actor pool** — classification pipeline uses Ray's actor model directly. Needs redesign around Fray v2 `ActorGroup`.

### Sequencing

1. **Delete Ray orphans** — the 7 files above with no library consumers
2. **Migrate infra modules** — Levanter callbacks (trivial `v1`→`v2` swap), then Executor and Levanter distributed (need design work). These have broad downstream consumers (trainer, RL, evaluation, 100+ experiment files) so they must move first.
3. **Migrate leaf modules** — RL, Evaluation, Classification, Export, Inference, Fasttext. Can proceed in parallel once infra is ready.
4. **Bake period** — run all workflows off Ray
5. **Remove `fray.v1` and `ray` dependency from the repo**

### Related issues

- #4269 — Off Ray completely (umbrella)
- #3959 / #3960 — RL pipeline migration to Fray v2/Iris
- #4088 / #4090 — Classification inference migration
- #4398 — Logprob evals migration (merged)

### Definition of done

- [ ] Zero `import ray` in `lib/marin/src/marin/` (excluding test helpers)
- [ ] Zero `from fray.v1` imports in `lib/marin/src/marin/`
- [ ] Zero `from fray.v1` imports in `lib/levanter/src/levanter/`
- [ ] Test fixtures in `tests/` use v2 APIs
- [ ] Ray orphan files deleted (tracked in #4269)
- [ ] `.agents/docs/fray-migration.md` updated or deleted (currently stale)
- [ ] No regressions in eval, RL, classification, or export workflows

Area	Files	What they use Ray for
Classification pipeline (3 files)	`processing/classification/inference.py`, `classifier.py`, `autoscaler.py`	`@ray.remote` actors, `ray.util.queue.Queue`, `RayActorError` — full actor-pool pattern for distributed inference
Executor (1 file)	`execution/executor.py`	`ray.remote_function.RemoteFunction` isinstance check + `ray.get()` — legacy support for `@ray.remote`-decorated executor steps
Levanter distributed (1 file)	`levanter/distributed.py`	`ray.init()`, `ray.shutdown()` — Levanter's distributed runtime init
vLLM server (1 file)	`inference/vllm_server.py`	`from ray._private.accelerators import TPUAcceleratorManager` in try/except — optional TPU detection

Area	Files	v1 APIs used
RL (7 files)	`rl/rl_job.py`, `rl/train_worker.py`, `rl/rollout_worker.py`, `rl/curriculum.py`, `rl/weight_transfer/{arrow_flight,jax}.py`, `rl/scripts/evaluate_environment.py`	`current_cluster()`, `JobRequest`, `get_default_job_ctx()`
Evaluation (9 files)	`evaluation/evaluators/{evaluator,lm_evaluation_harness_evaluator,harbor_evaluator,simple_evaluator,levanter_tpu_evaluator,evalchemy_evaluator,levanter_lm_eval_evaluator}.py`, `evaluation/{visualize,run}.py`	`current_cluster()`, `JobRequest`, `build_runtime_env_for_packages`
Processing	`processing/classification/fasttext/train_fasttext.py`	`current_cluster()`, `JobRequest`
Export	`export/levanter_checkpoint.py`	`current_cluster()`, `JobRequest`
Inference	`inference/vllm_smoke_test.py`	`current_cluster()`, `JobRequest`
Levanter	`levanter/callbacks/_metrics.py`	`fray.v1.cluster.device_flops` (trivial — v2 equivalent exists at `fray.v2.device_flops`)
Tests	`tests/conftest.py`, `tests/transform/conftest.py`, `tests/datakit/download/conftest.py`, `tests/rl/test_weight_transfer.py`, `tests/rl/integration/test_iris_integration.py`, `tests/integration_test.py`, `lib/zephyr/tests/conftest.py`	`create_cluster`, `set_current_cluster`, `fray_default_job_ctx`, `ray.init()`, `@ray.remote` in test

File	Why it's orphan
`run/ray_run.py`	CLI entry point only. `zephyr/cli.py` shells out to it for `zephyr ray submit`, but that's a deprecated codepath.
`scripts/ray/cluster.py`	Standalone CLI for Ray cluster ops. No importers.
`scripts/ray/dev_tpu.py`	Standalone CLI for dev TPU allocation. No importers. Replaced by `scripts/iris/dev_tpu.py`.
`levanter/infra/ray_tpu.py`	Only imported by `launch_on_ray.py` (below).
`levanter/infra/launch_on_ray.py`	Only imports `ray_tpu.py`. No external callers. Dead chain.
`marin/cluster/ray.py`	Only imported by `ray_run.py`.
`scripts/debug/inspect_data.py`	Debug script with `ray.init()`, `@ray.remote`, `JobSubmissionClient`. No importers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate all supported Marin code off Ray #4453