Skip to content

Docs sweep: replace ray_run.py / launch_on_ray / ray_tpu references after deletion (#5028, #5031) #5029

@yonromai

Description

@yonromai

🤖 Follow-up to the Ray deletion PRs (#5028, #5031). Non-code files still reference modules that have been deleted. A mechanical sed pass would produce misleading docs because scripts/iris/dev_tpu.py is not a drop-in replacement for ray_run.py — it's a persistent-session tool (allocateexecuterelease), not a one-shot submitter. Flags diverge: --no_wait, --extra, --cluster, --entrypoint-num-*, --auto-stop, --submission-id have no equivalent; --env_vars KEY VALUE becomes -e KEY=VALUE only on subcommands. The Levanter launch_on_ray references are similarly workflow-specific.

Files to update (owners in parens)

From #5028 (ray_run.py / marin.cluster.ray references)

Tutorials (docs team)

  • docs/explanations/executor.md:108
  • docs/tutorials/train-dpo.md:143
  • docs/tutorials/train-an-lm.md:152
  • docs/tutorials/tpu-cluster-setup.md:101-109 (may be deleted wholesale with the operator-tooling cleanup)

Recipes

  • docs/recipes/add_scaling_heuristic.md:45,179 (uses --cluster marin-us-central2 — needs owner input on target cluster)

Skills / planning

  • .agents/skills/ferries/SKILL.md:127,141
  • .agents/skills/architecture/SKILL.md:16,27
  • .agents/projects/ferry_framework.md:277

Runbooks / experiment READMEs

  • experiments/tootsie/BABYSITTING.md:15,62,69,76,84 (tootsie operators)
  • experiments/grug/README.md:43
  • experiments/README_sft.md:12,44,47

Docstring / header Usage: lines

  • experiments/tutorials/exp1077_reproduce_dclm_1b1x.py:14
  • experiments/tutorials/exp1078_reproduce_dclm_7b1x.py:14
  • experiments/rollout_data/{synthetic1,swe_rebench_openhands,principia,nemotron_terminal,gpt_oss_rollouts,superior_reasoning,coderforge}.py (7 files, all Usage: at ~line 7)
  • experiments/ferries/daily.py:14 (prose reference)

From #5031 (launch_on_ray / ray_tpu references)

Levanter docs

  • lib/levanter/docs/Getting-Started-TPU-VM.md — 5 references to launch_on_ray (feature description, caveats, usage example, deprecation note)

Questions to resolve before sweeping

  1. Canonical one-shot launcher for executor-driven experiments in the Iris era — is it just python experiments/foo.py, assuming executor_main routes via fray → Iris?
  2. --no_wait equivalent for detached/long-running launches (ferries, tootsie).
  3. docs/tutorials/tpu-cluster-setup.md — rewrite or delete with the operator-tooling cleanup (scripts/ray/*, 18 × infra/marin-*.yaml)?
  4. lib/levanter/docs/Getting-Started-TPU-VM.md — rewrite the launch_on_ray sections to point at the fray/Iris TPU path, or delete them entirely if that workflow is deprecated?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions