Commit 5f8625c
authored
docs: retire Ray launcher references; route to Iris (#5076)
## Summary
Stage 4 of the Ray-removal plan (#4453): sweep `ray_run.py` /
`scripts/ray/*` / `ray up` / `ray down` references out of user-facing
docs, agent skills, runbooks, and experiment docstrings, ahead of the
production Ray cluster retirement. Replacements route readers to the
live Iris launcher pattern documented in `experiments/ferries/OPS.md`:
```
uv run iris --cluster=marin job run ...
```
Out of scope (other retirement stages): `scripts/ray/*` code,
`infra/marin-*.yaml`, `infra/README.md` "Maintaining a Ray Cluster"
section, and `lib/marin/src/marin/run/ray_run.py` itself. Historical
design/logbook docs (`.agents/projects/20251114_fray_design.md`,
`linear_ce_loss.md`) are intentionally left alone — they document past
state.
## What was rewritten
- **Docs (mkdocs)**: `explanations/executor.md`,
`explanations/{evaluation,experiments,guidelines,marin-prefix}.md`,
`harbor-integration.md`, `references/resource-config.md`,
`tutorials/{train-an-lm,train-dpo,storage-bucket,local-gpu,first-experiment,executor-101,installation}.md`,
`recipes/add_scaling_heuristic.md`.
- **Agent content**: `.agents/skills/ferries/SKILL.md` (daily launch
command + "Ray job id" labels), `.agents/skills/architecture/SKILL.md`
(entry point + infra prose), `.agents/projects/ferry_framework.md`
(launch shape + run-record fields).
- **Runbooks**: `experiments/tootsie/BABYSITTING.md` — five launch
snippets rewritten mechanically, preserving `--force_run_failed` /
`--run_only` flags; `experiments/grug/README.md`;
`experiments/README_sft.md`.
- **Module docstrings**: `experiments/ferries/daily.py`;
`experiments/tutorials/exp1077_reproduce_dclm_1b1x.py`,
`exp1078_reproduce_dclm_7b1x.py`; `experiments/rollout_data/*.py` (7
files with identical `Usage:` pattern).
## What was deleted
- `docs/dev-guide/rebuilding-cluster.md` — entirely about `ray up` +
`scripts/ray/cluster.py update-configs` rebuild flow.
- `docs/tutorials/tpu-cluster-setup.md` — whole tutorial is `ray up` /
`ray submit` / `ray dashboard`; removed from `mkdocs.yml` nav;
`installation.md` cross-link retargeted at `lib/iris/OPS.md`.
- `lib/levanter/docs/design/Ray-Job-Manager.md` — design doc for the Ray
TPU job manager (implementation deleted in #5031).
- `.agents/docs/fray-migration.md` — past migration plan, superseded.
Levanter `Getting-Started-TPU-VM.md`: removed the "Using the Ray
Autoscaler" section (the code it described, `launch_on_ray.py`, was
deleted in #5031). `launch.py` guidance kept; prose now points at
`lib/iris/OPS.md` for Marin's shared-cluster path.
## Iris doc verification
Before rewriting, I audited `experiments/ferries/OPS.md` and
`lib/iris/OPS.md` end-to-end against the live `iris` CLI. Every
subcommand and flag cited in those docs resolved. Specifically verified:
```
$ uv run iris --help # top-level commands
$ uv run iris job run --help # -e/--env-vars, --no-wait, --memory, --extra, etc.
$ uv run iris cluster --help # start, stop, restart, dashboard, dashboard-proxy, status, vm, controller, list, start-smoke
$ uv run iris cluster controller --help # checkpoint, restart, serve, worker-restart
$ uv run iris task exec --help # TASK_ID COMMAND..., --timeout
$ uv run iris process --help # status, logs, profile
$ uv run iris process profile --help # threads|cpu|mem, -t target, -d duration
$ uv run iris rpc controller --help # get-scheduler-state, get-autoscaler-status, etc.
$ uv run iris query --help # -f table|json|csv
$ uv run iris cluster vm status --help # --scale-group
$ uv run iris user budget --help # get, list, set
$ uv run iris cluster list # confirmed `marin` resolves to lib/iris/examples/marin.yaml
```
Every flag, example, and cross-reference in the two Iris docs matched
the live CLI output. Referenced scripts exist
(`scripts/datakit/validate_ferry_outputs.py`,
`.github/workflows/marin-datakit-smoke.yaml`) and
`lib/iris/docs/priority-bands.md` exists. **No Iris-doc corrections
required.**
## Test plan
- [x] `./infra/pre-commit.py --all-files --fix` passes (ruff, black,
license headers, pyrefly, markdown, yaml, etc.).
- [x] `rg 'ray_run|scripts/ray/|ray up|ray down|ray submit|ray job
submit|RAY_AUTH_TOKEN' docs/ lib/levanter/docs/ .agents/ experiments/`
returns zero hits outside the two historical logbook files
(`.agents/projects/20251114_fray_design.md`, `linear_ce_loss.md`) —
verified via `Grep` tool.
- [x] `rg 'tpu-cluster-setup'` returns zero hits (mkdocs nav +
installation.md cross-link updated).
- [x] `rg 'Ray-Job-Manager|fray-migration'` returns zero hits (both
deleted files unreferenced).
- [x] `rg 'launch_on_ray'` returns zero hits in `lib/levanter/docs/`.
## Review handoff — tootsie runbook
@dlwh @Helw150 @rjpower — the `experiments/tootsie/BABYSITTING.md`
rewrite in this PR is mechanical (preserves `--force_run_failed` /
`--run_only` flags verbatim, switches launcher to `uv run iris
--cluster=marin job run -- python -m
experiments.exp{600,750}_tootsie...`). Please sanity-check the exact
flags and cluster targeting — operators have right-of-refusal on the
syntax. The doc also drops the `manual_ray_worker_launch.py` "reattach
v4-2048" escape hatch and replaces it with the note that Iris handles
preemption/restart itself. Speak up if that's not yet a safe claim for
the big tootsie v4-2048 runs.
Refs: #4453 (parent Ray removal), #5029 (doc sweep tracker).
---------
Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>1 parent 4256d06 commit 5f8625c
38 files changed
Lines changed: 131 additions & 840 deletions
File tree
- .agents
- docs
- projects
- skills
- architecture
- ferries
- docs
- dev-guide
- explanations
- recipes
- references
- tutorials
- experiments
- ferries
- grug
- rollout_data
- tootsie
- tutorials
- lib
- iris
- levanter/docs
- design
This file was deleted.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
150 | 150 | | |
151 | 151 | | |
152 | 152 | | |
153 | | - | |
| 153 | + | |
154 | 154 | | |
155 | 155 | | |
156 | 156 | | |
| |||
179 | 179 | | |
180 | 180 | | |
181 | 181 | | |
182 | | - | |
| 182 | + | |
183 | 183 | | |
184 | 184 | | |
185 | 185 | | |
| |||
274 | 274 | | |
275 | 275 | | |
276 | 276 | | |
277 | | - | |
278 | | - | |
279 | | - | |
280 | | - | |
| 277 | + | |
| 278 | + | |
281 | 279 | | |
282 | 280 | | |
283 | 281 | | |
| |||
344 | 342 | | |
345 | 343 | | |
346 | 344 | | |
347 | | - | |
| 345 | + | |
348 | 346 | | |
349 | 347 | | |
350 | 348 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | | - | |
| 8 | + | |
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | | - | |
| 14 | + | |
15 | 15 | | |
16 | | - | |
| 16 | + | |
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
| 27 | + | |
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| |||
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
58 | | - | |
59 | | - | |
| 58 | + | |
| 59 | + | |
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
| |||
72 | 72 | | |
73 | 73 | | |
74 | 74 | | |
75 | | - | |
| 75 | + | |
76 | 76 | | |
77 | 77 | | |
78 | 78 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
42 | | - | |
| 42 | + | |
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| |||
124 | 124 | | |
125 | 125 | | |
126 | 126 | | |
127 | | - | |
128 | | - | |
129 | | - | |
130 | | - | |
| 127 | + | |
| 128 | + | |
131 | 129 | | |
132 | 130 | | |
133 | 131 | | |
134 | | - | |
| 132 | + | |
135 | 133 | | |
136 | 134 | | |
137 | 135 | | |
138 | 136 | | |
139 | 137 | | |
140 | 138 | | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | | - | |
145 | | - | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
146 | 142 | | |
147 | 143 | | |
148 | 144 | | |
| |||
158 | 154 | | |
159 | 155 | | |
160 | 156 | | |
161 | | - | |
| 157 | + | |
162 | 158 | | |
163 | 159 | | |
164 | 160 | | |
| |||
174 | 170 | | |
175 | 171 | | |
176 | 172 | | |
177 | | - | |
| 173 | + | |
178 | 174 | | |
179 | 175 | | |
180 | 176 | | |
| |||
0 commit comments