Skip to content

Migrate all supported Marin code off Ray #4453

@yonromai

Description

@yonromai

Migrate all supported Marin code off Ray

We're deprecating Ray in favor of Iris/Fray v2 (#4269). Execution, training, and Zephyr are already on v2, but a long tail of code still depends on Ray — either directly (import ray) or transitively via Fray v1 (which is Ray-backed). This issue tracks migrating every supported codepath so we can turn off the Ray cluster.

Ray entrypoints — code that needs migrating

These are library modules and workflows that consumers actively depend on. Each must be migrated to Fray v2 / Iris before Ray can be removed.

Direct import ray — needs architectural work (6 files)
Area Files What they use Ray for
Classification pipeline (3 files) processing/classification/inference.py, classifier.py, autoscaler.py @ray.remote actors, ray.util.queue.Queue, RayActorError — full actor-pool pattern for distributed inference
Executor (1 file) execution/executor.py ray.remote_function.RemoteFunction isinstance check + ray.get() — legacy support for @ray.remote-decorated executor steps
Levanter distributed (1 file) levanter/distributed.py ray.init(), ray.shutdown() — Levanter's distributed runtime init
vLLM server (1 file) inference/vllm_server.py from ray._private.accelerators import TPUAcceleratorManager in try/except — optional TPU detection
Via Fray v1 — needs import migration + current_cluster()current_client() rework (24 files)
Area Files v1 APIs used
RL (7 files) rl/rl_job.py, rl/train_worker.py, rl/rollout_worker.py, rl/curriculum.py, rl/weight_transfer/{arrow_flight,jax}.py, rl/scripts/evaluate_environment.py current_cluster(), JobRequest, get_default_job_ctx()
Evaluation (9 files) evaluation/evaluators/{evaluator,lm_evaluation_harness_evaluator,harbor_evaluator,simple_evaluator,levanter_tpu_evaluator,evalchemy_evaluator,levanter_lm_eval_evaluator}.py, evaluation/{visualize,run}.py current_cluster(), JobRequest, build_runtime_env_for_packages
Processing processing/classification/fasttext/train_fasttext.py current_cluster(), JobRequest
Export export/levanter_checkpoint.py current_cluster(), JobRequest
Inference inference/vllm_smoke_test.py current_cluster(), JobRequest
Levanter levanter/callbacks/_metrics.py fray.v1.cluster.device_flops (trivial — v2 equivalent exists at fray.v2.device_flops)
Tests tests/conftest.py, tests/transform/conftest.py, tests/datakit/download/conftest.py, tests/rl/test_weight_transfer.py, tests/rl/integration/test_iris_integration.py, tests/integration_test.py, lib/zephyr/tests/conftest.py create_cluster, set_current_cluster, fray_default_job_ctx, ray.init(), @ray.remote in test

Note: evaluation/run.py imports both fray.v1.cluster.ResourceConfig and fray.cluster.ResourceConfig side by side with aliases — a hybrid state from incremental migration.

Ray orphans — delete, don't migrate (7 files)

These modules have no importers in library code. They're standalone CLI tools or dead code that will be deleted as part of #4269. No migration work needed.

File Why it's orphan
run/ray_run.py CLI entry point only. zephyr/cli.py shells out to it for zephyr ray submit, but that's a deprecated codepath.
scripts/ray/cluster.py Standalone CLI for Ray cluster ops. No importers.
scripts/ray/dev_tpu.py Standalone CLI for dev TPU allocation. No importers. Replaced by scripts/iris/dev_tpu.py.
levanter/infra/ray_tpu.py Only imported by launch_on_ray.py (below).
levanter/infra/launch_on_ray.py Only imports ray_tpu.py. No external callers. Dead chain.
marin/cluster/ray.py Only imported by ray_run.py.
scripts/debug/inspect_data.py Debug script with ray.init(), @ray.remote, JobSubmissionClient. No importers.

v1 APIs with no obvious v2 equivalent

These need design work, not just mechanical import swaps:

  • get_default_job_ctx() — used pervasively in RL workers for peer discovery and actor creation. v2 uses current_client() but the job-context pattern is structurally different.
  • build_runtime_env_for_packages() — Ray-specific runtime env builder used in evaluators. Concept doesn't apply to Iris.
  • Direct @ray.remote actor pool — classification pipeline uses Ray's actor model directly. Needs redesign around Fray v2 ActorGroup.

Sequencing

  1. Delete Ray orphans — the 7 files above with no library consumers
  2. Migrate infra modules — Levanter callbacks (trivial v1v2 swap), then Executor and Levanter distributed (need design work). These have broad downstream consumers (trainer, RL, evaluation, 100+ experiment files) so they must move first.
  3. Migrate leaf modules — RL, Evaluation, Classification, Export, Inference, Fasttext. Can proceed in parallel once infra is ready.
  4. Bake period — run all workflows off Ray
  5. Remove fray.v1 and ray dependency from the repo

Related issues

Definition of done

  • Zero import ray in lib/marin/src/marin/ (excluding test helpers)
  • Zero from fray.v1 imports in lib/marin/src/marin/
  • Zero from fray.v1 imports in lib/levanter/src/levanter/
  • Test fixtures in tests/ use v2 APIs
  • Ray orphan files deleted (tracked in Single way of running jobs — off Ray completely #4269)
  • .agents/docs/fray-migration.md updated or deleted (currently stale)
  • No regressions in eval, RL, classification, or export workflows

Metadata

Metadata

Assignees

Labels

agent-generatedCreated by automation/agentepicTracking issue for a group of related issuesinfrastructure

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions