Skip to content

docs: retire Ray launcher references; route to Iris#5076

Merged
yonromai merged 3 commits intomainfrom
20260422-stage4-ray-docs
Apr 23, 2026
Merged

docs: retire Ray launcher references; route to Iris#5076
yonromai merged 3 commits intomainfrom
20260422-stage4-ray-docs

Conversation

@yonromai
Copy link
Copy Markdown
Contributor

Summary

Stage 4 of the Ray-removal plan (#4453): sweep ray_run.py / scripts/ray/* / ray up / ray down references out of user-facing docs, agent skills, runbooks, and experiment docstrings, ahead of the production Ray cluster retirement. Replacements route readers to the live Iris launcher pattern documented in experiments/ferries/OPS.md:

uv run iris --cluster=marin job run ...

Out of scope (other retirement stages): scripts/ray/* code, infra/marin-*.yaml, infra/README.md "Maintaining a Ray Cluster" section, and lib/marin/src/marin/run/ray_run.py itself. Historical design/logbook docs (.agents/projects/20251114_fray_design.md, linear_ce_loss.md) are intentionally left alone — they document past state.

What was rewritten

  • Docs (mkdocs): explanations/executor.md, explanations/{evaluation,experiments,guidelines,marin-prefix}.md, harbor-integration.md, references/resource-config.md, tutorials/{train-an-lm,train-dpo,storage-bucket,local-gpu,first-experiment,executor-101,installation}.md, recipes/add_scaling_heuristic.md.
  • Agent content: .agents/skills/ferries/SKILL.md (daily launch command + "Ray job id" labels), .agents/skills/architecture/SKILL.md (entry point + infra prose), .agents/projects/ferry_framework.md (launch shape + run-record fields).
  • Runbooks: experiments/tootsie/BABYSITTING.md — five launch snippets rewritten mechanically, preserving --force_run_failed / --run_only flags; experiments/grug/README.md; experiments/README_sft.md.
  • Module docstrings: experiments/ferries/daily.py; experiments/tutorials/exp1077_reproduce_dclm_1b1x.py, exp1078_reproduce_dclm_7b1x.py; experiments/rollout_data/*.py (7 files with identical Usage: pattern).

What was deleted

  • docs/dev-guide/rebuilding-cluster.md — entirely about ray up + scripts/ray/cluster.py update-configs rebuild flow.
  • docs/tutorials/tpu-cluster-setup.md — whole tutorial is ray up / ray submit / ray dashboard; removed from mkdocs.yml nav; installation.md cross-link retargeted at lib/iris/OPS.md.
  • lib/levanter/docs/design/Ray-Job-Manager.md — design doc for the Ray TPU job manager (implementation deleted in Delete Levanter Ray TPU infra #5031).
  • .agents/docs/fray-migration.md — past migration plan, superseded.

Levanter Getting-Started-TPU-VM.md: removed the "Using the Ray Autoscaler" section (the code it described, launch_on_ray.py, was deleted in #5031). launch.py guidance kept; prose now points at lib/iris/OPS.md for Marin's shared-cluster path.

Iris doc verification

Before rewriting, I audited experiments/ferries/OPS.md and lib/iris/OPS.md end-to-end against the live iris CLI. Every subcommand and flag cited in those docs resolved. Specifically verified:

$ uv run iris --help            # top-level commands
$ uv run iris job run --help    # -e/--env-vars, --no-wait, --memory, --extra, etc.
$ uv run iris cluster --help    # start, stop, restart, dashboard, dashboard-proxy, status, vm, controller, list, start-smoke
$ uv run iris cluster controller --help   # checkpoint, restart, serve, worker-restart
$ uv run iris task exec --help  # TASK_ID COMMAND..., --timeout
$ uv run iris process --help    # status, logs, profile
$ uv run iris process profile --help  # threads|cpu|mem, -t target, -d duration
$ uv run iris rpc controller --help   # get-scheduler-state, get-autoscaler-status, etc.
$ uv run iris query --help      # -f table|json|csv
$ uv run iris cluster vm status --help   # --scale-group
$ uv run iris user budget --help         # get, list, set
$ uv run iris cluster list      # confirmed `marin` resolves to lib/iris/examples/marin.yaml

Every flag, example, and cross-reference in the two Iris docs matched the live CLI output. Referenced scripts exist (scripts/datakit/validate_ferry_outputs.py, .github/workflows/marin-datakit-smoke.yaml) and lib/iris/docs/priority-bands.md exists. No Iris-doc corrections required.

Test plan

  • ./infra/pre-commit.py --all-files --fix passes (ruff, black, license headers, pyrefly, markdown, yaml, etc.).
  • rg 'ray_run|scripts/ray/|ray up|ray down|ray submit|ray job submit|RAY_AUTH_TOKEN' docs/ lib/levanter/docs/ .agents/ experiments/ returns zero hits outside the two historical logbook files (.agents/projects/20251114_fray_design.md, linear_ce_loss.md) — verified via Grep tool.
  • rg 'tpu-cluster-setup' returns zero hits (mkdocs nav + installation.md cross-link updated).
  • rg 'Ray-Job-Manager|fray-migration' returns zero hits (both deleted files unreferenced).
  • rg 'launch_on_ray' returns zero hits in lib/levanter/docs/.

Review handoff — tootsie runbook

@dlwh @Helw150 @rjpower — the experiments/tootsie/BABYSITTING.md rewrite in this PR is mechanical (preserves --force_run_failed / --run_only flags verbatim, switches launcher to uv run iris --cluster=marin job run -- python -m experiments.exp{600,750}_tootsie...). Please sanity-check the exact flags and cluster targeting — operators have right-of-refusal on the syntax. The doc also drops the manual_ray_worker_launch.py "reattach v4-2048" escape hatch and replaces it with the note that Iris handles preemption/restart itself. Speak up if that's not yet a safe claim for the big tootsie v4-2048 runs.

Refs: #4453 (parent Ray removal), #5029 (doc sweep tracker).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 416fec329e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/tutorials/train-an-lm.md Outdated
@yonromai yonromai requested review from dlwh and rjpower April 22, 2026 21:15
@yonromai
Copy link
Copy Markdown
Contributor Author

🤖 Pushed 7fdce509 to address the P1 validator comment. --memory=4G--memory=2G across all 18 stage-4 rewrites (matches the canonical experiments/ferries/OPS.md:17 pattern), plus lib/iris/OPS.md:42 corrected from --memory 16g--memory 2g with a note about the 4 GB opt-in threshold.

@yonromai yonromai mentioned this pull request Apr 22, 2026
4 tasks
Comment thread experiments/tootsie/BABYSITTING.md Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably just delete this since all Tootsie's are done

yonromai added a commit that referenced this pull request Apr 22, 2026
Per Helw150's review comment on #5076: the tootsie runs are complete,
so the runbook is dead content rather than something worth migrating
to the Iris command syntax. Confirmed no incoming references (mkdocs
nav, other docs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yonromai and others added 3 commits April 22, 2026 19:03
Sweep out `ray_run.py` / `scripts/ray/*` / `ray up` / `ray down` references
from user-facing docs, agent skills, runbooks, and experiment docstrings, in
preparation for the Ray cluster retirement (#4453). Replacements point at the
live Iris launcher pattern documented in `experiments/ferries/OPS.md`:

    uv run iris --cluster=marin job run ...

Rewrites (preserve prior flags/run-selection semantics):
- docs/explanations/executor.md — Ray section rewritten as Fray/Iris
- docs/tutorials/train-an-lm.md, train-dpo.md — ray_run snippets → iris job run
- docs/recipes/add_scaling_heuristic.md — two ray_run commands → iris job run
- docs/tutorials/storage-bucket.md, local-gpu.md, first-experiment.md,
  executor-101.md, installation.md — prose references → Iris/Fray
- docs/explanations/{evaluation,experiments,guidelines,marin-prefix}.md,
  references/resource-config.md, harbor-integration.md — prose scrubs
- .agents/skills/ferries/SKILL.md — daily launch cmd + "Ray job id" labels
- .agents/skills/architecture/SKILL.md — entry point + infra references
- .agents/projects/ferry_framework.md — launch shape + run-record fields
- experiments/tootsie/BABYSITTING.md — five runbook snippets
  (propose-then-handoff; review requested from dlwh/Helw150/rjpower)
- experiments/grug/README.md, experiments/README_sft.md
- experiments/ferries/daily.py — prose docstring
- experiments/tutorials/exp1077_reproduce_dclm_1b1x.py,
  exp1078_reproduce_dclm_7b1x.py — docstring example
- experiments/rollout_data/*.py (7 files) — identical `Usage:` docstring

Deletions (docs describing soon-retired code):
- docs/dev-guide/rebuilding-cluster.md — entirely about `ray up`/
  `scripts/ray/cluster.py` rebuild flow
- docs/tutorials/tpu-cluster-setup.md — whole tutorial is "ray up / ray submit /
  ray dashboard"; removed from mkdocs.yml nav; installation.md link retargeted
  at lib/iris/OPS.md
- lib/levanter/docs/design/Ray-Job-Manager.md — design doc for the Ray TPU
  job manager (deleted in #5031)
- .agents/docs/fray-migration.md — past migration plan, superseded

Levanter `Getting-Started-TPU-VM.md`: removed the "Using the Ray Autoscaler"
section (launch_on_ray.py was deleted in #5031), kept `launch.py` guidance.

Scope is doc-only. Not touched in this commit: `scripts/ray/*`,
`lib/marin/src/marin/run/ray_run.py`, `infra/marin-*.yaml`,
`infra/README.md`'s "Maintaining a Ray Cluster" section, and Ray references
inside historical design/logbook docs (`.agents/projects/20251114_fray_design.md`,
`linear_ce_loss.md`). Those belong to other retirement stages.

Refs: #4453 (parent), #5029 (doc sweep tracker)
iris job run rejects --memory >= 4 GB without --enable-extra-resources
(lib/iris/src/iris/cli/job.py:432). The mechanical rewrite produced
--memory=4G, which trips the validator; reduce to --memory=2G to match
the canonical ferries pattern (experiments/ferries/OPS.md:17). Also
correct the stale --memory 16g example in lib/iris/OPS.md:42.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Helw150's review comment on #5076: the tootsie runs are complete,
so the runbook is dead content rather than something worth migrating
to the Iris command syntax. Confirmed no incoming references (mkdocs
nav, other docs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yonromai yonromai force-pushed the 20260422-stage4-ray-docs branch from f9fdbb5 to dd3dedb Compare April 23, 2026 02:03
yonromai added a commit that referenced this pull request Apr 23, 2026
Removes 419 LOC of Ray-based job-submission CLI with zero in-repo
callers. Doc and runbook references were swept in #5076; the canonical
replacement is `uv run iris --cluster=marin job run --no-wait ...`.

Part of the Ray-removal effort tracked in #4453.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yonromai yonromai enabled auto-merge (squash) April 23, 2026 02:06
yonromai added a commit that referenced this pull request Apr 23, 2026
Removes 419 LOC of Ray-based job-submission CLI with zero in-repo
callers. Doc and runbook references were swept in #5076; the canonical
replacement is `uv run iris --cluster=marin job run --no-wait ...`.

Part of the Ray-removal effort tracked in #4453.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yonromai yonromai merged commit 5f8625c into main Apr 23, 2026
39 checks passed
@yonromai yonromai deleted the 20260422-stage4-ray-docs branch April 23, 2026 02:13
yonromai added a commit that referenced this pull request Apr 23, 2026
Removes 419 LOC of Ray-based job-submission CLI with zero in-repo
callers. Doc and runbook references were swept in #5076; the canonical
replacement is `uv run iris --cluster=marin job run --no-wait ...`.

Part of the Ray-removal effort tracked in #4453.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yonromai added a commit that referenced this pull request Apr 23, 2026
Removes 419 LOC of Ray-based job-submission CLI with zero in-repo
callers. Doc and runbook references were swept in #5076; the canonical
replacement is `uv run iris --cluster=marin job run --no-wait ...`.

Part of the Ray-removal effort tracked in #4453.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yonromai added a commit that referenced this pull request Apr 23, 2026
## Summary

Deletes `lib/marin/src/marin/run/ray_run.py` (419 LOC). Zero in-repo
code callers after the docs sweep in #5076. The canonical replacement is
`uv run iris --cluster=marin job run --no-wait ...`.

Part of the Ray-removal effort tracked in #4453 (stage 3h).

## Stacking

This PR is stacked on #5076 (stage 4 docs sweep). It temporarily shows
stage-4 diffs in its "Files changed" tab; once #5076 merges into main,
only the single deletion will remain. Retarget to `main` after #5076
lands.

## Test plan

- [x] `rg '\bray_run\b|marin\.run\.ray_run' lib/ experiments/ scripts/`
returns zero hits.
- [x] `lib/marin/src/marin/run/__init__.py` does not re-export from
`ray_run` (file is empty aside from the license header).
- [x] `uv run pyrefly` clean (via pre-commit).
- [x] `./infra/pre-commit.py --all-files --fix` clean.

---------

Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants