Migrate all supported Marin code off Ray
We're deprecating Ray in favor of Iris/Fray v2 (#4269). Execution, training, and Zephyr are already on v2, but a long tail of code still depends on Ray — either directly (import ray) or transitively via Fray v1 (which is Ray-backed). This issue tracks migrating every supported codepath so we can turn off the Ray cluster.
Ray entrypoints — code that needs migrating
These are library modules and workflows that consumers actively depend on. Each must be migrated to Fray v2 / Iris before Ray can be removed.
Direct import ray — needs architectural work (6 files)
| Area |
Files |
What they use Ray for |
| Classification pipeline (3 files) |
processing/classification/inference.py, classifier.py, autoscaler.py |
@ray.remote actors, ray.util.queue.Queue, RayActorError — full actor-pool pattern for distributed inference |
| Executor (1 file) |
execution/executor.py |
ray.remote_function.RemoteFunction isinstance check + ray.get() — legacy support for @ray.remote-decorated executor steps |
| Levanter distributed (1 file) |
levanter/distributed.py |
ray.init(), ray.shutdown() — Levanter's distributed runtime init |
| vLLM server (1 file) |
inference/vllm_server.py |
from ray._private.accelerators import TPUAcceleratorManager in try/except — optional TPU detection |
Via Fray v1 — needs import migration + current_cluster() → current_client() rework (24 files)
| Area |
Files |
v1 APIs used |
| RL (7 files) |
rl/rl_job.py, rl/train_worker.py, rl/rollout_worker.py, rl/curriculum.py, rl/weight_transfer/{arrow_flight,jax}.py, rl/scripts/evaluate_environment.py |
current_cluster(), JobRequest, get_default_job_ctx() |
| Evaluation (9 files) |
evaluation/evaluators/{evaluator,lm_evaluation_harness_evaluator,harbor_evaluator,simple_evaluator,levanter_tpu_evaluator,evalchemy_evaluator,levanter_lm_eval_evaluator}.py, evaluation/{visualize,run}.py |
current_cluster(), JobRequest, build_runtime_env_for_packages |
| Processing |
processing/classification/fasttext/train_fasttext.py |
current_cluster(), JobRequest |
| Export |
export/levanter_checkpoint.py |
current_cluster(), JobRequest |
| Inference |
inference/vllm_smoke_test.py |
current_cluster(), JobRequest |
| Levanter |
levanter/callbacks/_metrics.py |
fray.v1.cluster.device_flops (trivial — v2 equivalent exists at fray.v2.device_flops) |
| Tests |
tests/conftest.py, tests/transform/conftest.py, tests/datakit/download/conftest.py, tests/rl/test_weight_transfer.py, tests/rl/integration/test_iris_integration.py, tests/integration_test.py, lib/zephyr/tests/conftest.py |
create_cluster, set_current_cluster, fray_default_job_ctx, ray.init(), @ray.remote in test |
Note: evaluation/run.py imports both fray.v1.cluster.ResourceConfig and fray.cluster.ResourceConfig side by side with aliases — a hybrid state from incremental migration.
Ray orphans — delete, don't migrate (7 files)
These modules have no importers in library code. They're standalone CLI tools or dead code that will be deleted as part of #4269. No migration work needed.
| File |
Why it's orphan |
run/ray_run.py |
CLI entry point only. zephyr/cli.py shells out to it for zephyr ray submit, but that's a deprecated codepath. |
scripts/ray/cluster.py |
Standalone CLI for Ray cluster ops. No importers. |
scripts/ray/dev_tpu.py |
Standalone CLI for dev TPU allocation. No importers. Replaced by scripts/iris/dev_tpu.py. |
levanter/infra/ray_tpu.py |
Only imported by launch_on_ray.py (below). |
levanter/infra/launch_on_ray.py |
Only imports ray_tpu.py. No external callers. Dead chain. |
marin/cluster/ray.py |
Only imported by ray_run.py. |
scripts/debug/inspect_data.py |
Debug script with ray.init(), @ray.remote, JobSubmissionClient. No importers. |
v1 APIs with no obvious v2 equivalent
These need design work, not just mechanical import swaps:
get_default_job_ctx() — used pervasively in RL workers for peer discovery and actor creation. v2 uses current_client() but the job-context pattern is structurally different.
build_runtime_env_for_packages() — Ray-specific runtime env builder used in evaluators. Concept doesn't apply to Iris.
- Direct
@ray.remote actor pool — classification pipeline uses Ray's actor model directly. Needs redesign around Fray v2 ActorGroup.
Sequencing
- Delete Ray orphans — the 7 files above with no library consumers
- Migrate infra modules — Levanter callbacks (trivial
v1→v2 swap), then Executor and Levanter distributed (need design work). These have broad downstream consumers (trainer, RL, evaluation, 100+ experiment files) so they must move first.
- Migrate leaf modules — RL, Evaluation, Classification, Export, Inference, Fasttext. Can proceed in parallel once infra is ready.
- Bake period — run all workflows off Ray
- Remove
fray.v1 and ray dependency from the repo
Related issues
Definition of done
Migrate all supported Marin code off Ray
We're deprecating Ray in favor of Iris/Fray v2 (#4269). Execution, training, and Zephyr are already on v2, but a long tail of code still depends on Ray — either directly (
import ray) or transitively via Fray v1 (which is Ray-backed). This issue tracks migrating every supported codepath so we can turn off the Ray cluster.Ray entrypoints — code that needs migrating
These are library modules and workflows that consumers actively depend on. Each must be migrated to Fray v2 / Iris before Ray can be removed.
Direct
import ray— needs architectural work (6 files)processing/classification/inference.py,classifier.py,autoscaler.py@ray.remoteactors,ray.util.queue.Queue,RayActorError— full actor-pool pattern for distributed inferenceexecution/executor.pyray.remote_function.RemoteFunctionisinstance check +ray.get()— legacy support for@ray.remote-decorated executor stepslevanter/distributed.pyray.init(),ray.shutdown()— Levanter's distributed runtime initinference/vllm_server.pyfrom ray._private.accelerators import TPUAcceleratorManagerin try/except — optional TPU detectionVia Fray v1 — needs import migration +
current_cluster()→current_client()rework (24 files)rl/rl_job.py,rl/train_worker.py,rl/rollout_worker.py,rl/curriculum.py,rl/weight_transfer/{arrow_flight,jax}.py,rl/scripts/evaluate_environment.pycurrent_cluster(),JobRequest,get_default_job_ctx()evaluation/evaluators/{evaluator,lm_evaluation_harness_evaluator,harbor_evaluator,simple_evaluator,levanter_tpu_evaluator,evalchemy_evaluator,levanter_lm_eval_evaluator}.py,evaluation/{visualize,run}.pycurrent_cluster(),JobRequest,build_runtime_env_for_packagesprocessing/classification/fasttext/train_fasttext.pycurrent_cluster(),JobRequestexport/levanter_checkpoint.pycurrent_cluster(),JobRequestinference/vllm_smoke_test.pycurrent_cluster(),JobRequestlevanter/callbacks/_metrics.pyfray.v1.cluster.device_flops(trivial — v2 equivalent exists atfray.v2.device_flops)tests/conftest.py,tests/transform/conftest.py,tests/datakit/download/conftest.py,tests/rl/test_weight_transfer.py,tests/rl/integration/test_iris_integration.py,tests/integration_test.py,lib/zephyr/tests/conftest.pycreate_cluster,set_current_cluster,fray_default_job_ctx,ray.init(),@ray.remotein testNote:
evaluation/run.pyimports bothfray.v1.cluster.ResourceConfigandfray.cluster.ResourceConfigside by side with aliases — a hybrid state from incremental migration.Ray orphans — delete, don't migrate (7 files)
These modules have no importers in library code. They're standalone CLI tools or dead code that will be deleted as part of #4269. No migration work needed.
run/ray_run.pyzephyr/cli.pyshells out to it forzephyr ray submit, but that's a deprecated codepath.scripts/ray/cluster.pyscripts/ray/dev_tpu.pyscripts/iris/dev_tpu.py.levanter/infra/ray_tpu.pylaunch_on_ray.py(below).levanter/infra/launch_on_ray.pyray_tpu.py. No external callers. Dead chain.marin/cluster/ray.pyray_run.py.scripts/debug/inspect_data.pyray.init(),@ray.remote,JobSubmissionClient. No importers.v1 APIs with no obvious v2 equivalent
These need design work, not just mechanical import swaps:
get_default_job_ctx()— used pervasively in RL workers for peer discovery and actor creation. v2 usescurrent_client()but the job-context pattern is structurally different.build_runtime_env_for_packages()— Ray-specific runtime env builder used in evaluators. Concept doesn't apply to Iris.@ray.remoteactor pool — classification pipeline uses Ray's actor model directly. Needs redesign around Fray v2ActorGroup.Sequencing
v1→v2swap), then Executor and Levanter distributed (need design work). These have broad downstream consumers (trainer, RL, evaluation, 100+ experiment files) so they must move first.fray.v1andraydependency from the repoRelated issues
Definition of done
import rayinlib/marin/src/marin/(excluding test helpers)from fray.v1imports inlib/marin/src/marin/from fray.v1imports inlib/levanter/src/levanter/tests/use v2 APIs.agents/docs/fray-migration.mdupdated or deleted (currently stale)