docs: retire Ray launcher references; route to Iris#5076
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 416fec329e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
🤖 Pushed |
There was a problem hiding this comment.
We can probably just delete this since all Tootsie's are done
Per Helw150's review comment on #5076: the tootsie runs are complete, so the runbook is dead content rather than something worth migrating to the Iris command syntax. Confirmed no incoming references (mkdocs nav, other docs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep out `ray_run.py` / `scripts/ray/*` / `ray up` / `ray down` references from user-facing docs, agent skills, runbooks, and experiment docstrings, in preparation for the Ray cluster retirement (#4453). Replacements point at the live Iris launcher pattern documented in `experiments/ferries/OPS.md`: uv run iris --cluster=marin job run ... Rewrites (preserve prior flags/run-selection semantics): - docs/explanations/executor.md — Ray section rewritten as Fray/Iris - docs/tutorials/train-an-lm.md, train-dpo.md — ray_run snippets → iris job run - docs/recipes/add_scaling_heuristic.md — two ray_run commands → iris job run - docs/tutorials/storage-bucket.md, local-gpu.md, first-experiment.md, executor-101.md, installation.md — prose references → Iris/Fray - docs/explanations/{evaluation,experiments,guidelines,marin-prefix}.md, references/resource-config.md, harbor-integration.md — prose scrubs - .agents/skills/ferries/SKILL.md — daily launch cmd + "Ray job id" labels - .agents/skills/architecture/SKILL.md — entry point + infra references - .agents/projects/ferry_framework.md — launch shape + run-record fields - experiments/tootsie/BABYSITTING.md — five runbook snippets (propose-then-handoff; review requested from dlwh/Helw150/rjpower) - experiments/grug/README.md, experiments/README_sft.md - experiments/ferries/daily.py — prose docstring - experiments/tutorials/exp1077_reproduce_dclm_1b1x.py, exp1078_reproduce_dclm_7b1x.py — docstring example - experiments/rollout_data/*.py (7 files) — identical `Usage:` docstring Deletions (docs describing soon-retired code): - docs/dev-guide/rebuilding-cluster.md — entirely about `ray up`/ `scripts/ray/cluster.py` rebuild flow - docs/tutorials/tpu-cluster-setup.md — whole tutorial is "ray up / ray submit / ray dashboard"; removed from mkdocs.yml nav; installation.md link retargeted at lib/iris/OPS.md - lib/levanter/docs/design/Ray-Job-Manager.md — design doc for the Ray TPU job manager (deleted in #5031) - .agents/docs/fray-migration.md — past migration plan, superseded Levanter `Getting-Started-TPU-VM.md`: removed the "Using the Ray Autoscaler" section (launch_on_ray.py was deleted in #5031), kept `launch.py` guidance. Scope is doc-only. Not touched in this commit: `scripts/ray/*`, `lib/marin/src/marin/run/ray_run.py`, `infra/marin-*.yaml`, `infra/README.md`'s "Maintaining a Ray Cluster" section, and Ray references inside historical design/logbook docs (`.agents/projects/20251114_fray_design.md`, `linear_ce_loss.md`). Those belong to other retirement stages. Refs: #4453 (parent), #5029 (doc sweep tracker)
iris job run rejects --memory >= 4 GB without --enable-extra-resources (lib/iris/src/iris/cli/job.py:432). The mechanical rewrite produced --memory=4G, which trips the validator; reduce to --memory=2G to match the canonical ferries pattern (experiments/ferries/OPS.md:17). Also correct the stale --memory 16g example in lib/iris/OPS.md:42. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Helw150's review comment on #5076: the tootsie runs are complete, so the runbook is dead content rather than something worth migrating to the Iris command syntax. Confirmed no incoming references (mkdocs nav, other docs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f9fdbb5 to
dd3dedb
Compare
Removes 419 LOC of Ray-based job-submission CLI with zero in-repo callers. Doc and runbook references were swept in #5076; the canonical replacement is `uv run iris --cluster=marin job run --no-wait ...`. Part of the Ray-removal effort tracked in #4453. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes 419 LOC of Ray-based job-submission CLI with zero in-repo callers. Doc and runbook references were swept in #5076; the canonical replacement is `uv run iris --cluster=marin job run --no-wait ...`. Part of the Ray-removal effort tracked in #4453. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes 419 LOC of Ray-based job-submission CLI with zero in-repo callers. Doc and runbook references were swept in #5076; the canonical replacement is `uv run iris --cluster=marin job run --no-wait ...`. Part of the Ray-removal effort tracked in #4453. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes 419 LOC of Ray-based job-submission CLI with zero in-repo callers. Doc and runbook references were swept in #5076; the canonical replacement is `uv run iris --cluster=marin job run --no-wait ...`. Part of the Ray-removal effort tracked in #4453. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary Deletes `lib/marin/src/marin/run/ray_run.py` (419 LOC). Zero in-repo code callers after the docs sweep in #5076. The canonical replacement is `uv run iris --cluster=marin job run --no-wait ...`. Part of the Ray-removal effort tracked in #4453 (stage 3h). ## Stacking This PR is stacked on #5076 (stage 4 docs sweep). It temporarily shows stage-4 diffs in its "Files changed" tab; once #5076 merges into main, only the single deletion will remain. Retarget to `main` after #5076 lands. ## Test plan - [x] `rg '\bray_run\b|marin\.run\.ray_run' lib/ experiments/ scripts/` returns zero hits. - [x] `lib/marin/src/marin/run/__init__.py` does not re-export from `ray_run` (file is empty aside from the license header). - [x] `uv run pyrefly` clean (via pre-commit). - [x] `./infra/pre-commit.py --all-files --fix` clean. --------- Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
Summary
Stage 4 of the Ray-removal plan (#4453): sweep
ray_run.py/scripts/ray/*/ray up/ray downreferences out of user-facing docs, agent skills, runbooks, and experiment docstrings, ahead of the production Ray cluster retirement. Replacements route readers to the live Iris launcher pattern documented inexperiments/ferries/OPS.md:Out of scope (other retirement stages):
scripts/ray/*code,infra/marin-*.yaml,infra/README.md"Maintaining a Ray Cluster" section, andlib/marin/src/marin/run/ray_run.pyitself. Historical design/logbook docs (.agents/projects/20251114_fray_design.md,linear_ce_loss.md) are intentionally left alone — they document past state.What was rewritten
explanations/executor.md,explanations/{evaluation,experiments,guidelines,marin-prefix}.md,harbor-integration.md,references/resource-config.md,tutorials/{train-an-lm,train-dpo,storage-bucket,local-gpu,first-experiment,executor-101,installation}.md,recipes/add_scaling_heuristic.md..agents/skills/ferries/SKILL.md(daily launch command + "Ray job id" labels),.agents/skills/architecture/SKILL.md(entry point + infra prose),.agents/projects/ferry_framework.md(launch shape + run-record fields).experiments/tootsie/BABYSITTING.md— five launch snippets rewritten mechanically, preserving--force_run_failed/--run_onlyflags;experiments/grug/README.md;experiments/README_sft.md.experiments/ferries/daily.py;experiments/tutorials/exp1077_reproduce_dclm_1b1x.py,exp1078_reproduce_dclm_7b1x.py;experiments/rollout_data/*.py(7 files with identicalUsage:pattern).What was deleted
docs/dev-guide/rebuilding-cluster.md— entirely aboutray up+scripts/ray/cluster.py update-configsrebuild flow.docs/tutorials/tpu-cluster-setup.md— whole tutorial isray up/ray submit/ray dashboard; removed frommkdocs.ymlnav;installation.mdcross-link retargeted atlib/iris/OPS.md.lib/levanter/docs/design/Ray-Job-Manager.md— design doc for the Ray TPU job manager (implementation deleted in Delete Levanter Ray TPU infra #5031)..agents/docs/fray-migration.md— past migration plan, superseded.Levanter
Getting-Started-TPU-VM.md: removed the "Using the Ray Autoscaler" section (the code it described,launch_on_ray.py, was deleted in #5031).launch.pyguidance kept; prose now points atlib/iris/OPS.mdfor Marin's shared-cluster path.Iris doc verification
Before rewriting, I audited
experiments/ferries/OPS.mdandlib/iris/OPS.mdend-to-end against the liveirisCLI. Every subcommand and flag cited in those docs resolved. Specifically verified:Every flag, example, and cross-reference in the two Iris docs matched the live CLI output. Referenced scripts exist (
scripts/datakit/validate_ferry_outputs.py,.github/workflows/marin-datakit-smoke.yaml) andlib/iris/docs/priority-bands.mdexists. No Iris-doc corrections required.Test plan
./infra/pre-commit.py --all-files --fixpasses (ruff, black, license headers, pyrefly, markdown, yaml, etc.).rg 'ray_run|scripts/ray/|ray up|ray down|ray submit|ray job submit|RAY_AUTH_TOKEN' docs/ lib/levanter/docs/ .agents/ experiments/returns zero hits outside the two historical logbook files (.agents/projects/20251114_fray_design.md,linear_ce_loss.md) — verified viaGreptool.rg 'tpu-cluster-setup'returns zero hits (mkdocs nav + installation.md cross-link updated).rg 'Ray-Job-Manager|fray-migration'returns zero hits (both deleted files unreferenced).rg 'launch_on_ray'returns zero hits inlib/levanter/docs/.Review handoff — tootsie runbook
@dlwh @Helw150 @rjpower — the
experiments/tootsie/BABYSITTING.mdrewrite in this PR is mechanical (preserves--force_run_failed/--run_onlyflags verbatim, switches launcher touv run iris --cluster=marin job run -- python -m experiments.exp{600,750}_tootsie...). Please sanity-check the exact flags and cluster targeting — operators have right-of-refusal on the syntax. The doc also drops themanual_ray_worker_launch.py"reattach v4-2048" escape hatch and replaces it with the note that Iris handles preemption/restart itself. Speak up if that's not yet a safe claim for the big tootsie v4-2048 runs.Refs: #4453 (parent Ray removal), #5029 (doc sweep tracker).