Skip to content

Commit 5f8625c

Browse files
authored
docs: retire Ray launcher references; route to Iris (#5076)
## Summary Stage 4 of the Ray-removal plan (#4453): sweep `ray_run.py` / `scripts/ray/*` / `ray up` / `ray down` references out of user-facing docs, agent skills, runbooks, and experiment docstrings, ahead of the production Ray cluster retirement. Replacements route readers to the live Iris launcher pattern documented in `experiments/ferries/OPS.md`: ``` uv run iris --cluster=marin job run ... ``` Out of scope (other retirement stages): `scripts/ray/*` code, `infra/marin-*.yaml`, `infra/README.md` "Maintaining a Ray Cluster" section, and `lib/marin/src/marin/run/ray_run.py` itself. Historical design/logbook docs (`.agents/projects/20251114_fray_design.md`, `linear_ce_loss.md`) are intentionally left alone — they document past state. ## What was rewritten - **Docs (mkdocs)**: `explanations/executor.md`, `explanations/{evaluation,experiments,guidelines,marin-prefix}.md`, `harbor-integration.md`, `references/resource-config.md`, `tutorials/{train-an-lm,train-dpo,storage-bucket,local-gpu,first-experiment,executor-101,installation}.md`, `recipes/add_scaling_heuristic.md`. - **Agent content**: `.agents/skills/ferries/SKILL.md` (daily launch command + "Ray job id" labels), `.agents/skills/architecture/SKILL.md` (entry point + infra prose), `.agents/projects/ferry_framework.md` (launch shape + run-record fields). - **Runbooks**: `experiments/tootsie/BABYSITTING.md` — five launch snippets rewritten mechanically, preserving `--force_run_failed` / `--run_only` flags; `experiments/grug/README.md`; `experiments/README_sft.md`. - **Module docstrings**: `experiments/ferries/daily.py`; `experiments/tutorials/exp1077_reproduce_dclm_1b1x.py`, `exp1078_reproduce_dclm_7b1x.py`; `experiments/rollout_data/*.py` (7 files with identical `Usage:` pattern). ## What was deleted - `docs/dev-guide/rebuilding-cluster.md` — entirely about `ray up` + `scripts/ray/cluster.py update-configs` rebuild flow. - `docs/tutorials/tpu-cluster-setup.md` — whole tutorial is `ray up` / `ray submit` / `ray dashboard`; removed from `mkdocs.yml` nav; `installation.md` cross-link retargeted at `lib/iris/OPS.md`. - `lib/levanter/docs/design/Ray-Job-Manager.md` — design doc for the Ray TPU job manager (implementation deleted in #5031). - `.agents/docs/fray-migration.md` — past migration plan, superseded. Levanter `Getting-Started-TPU-VM.md`: removed the "Using the Ray Autoscaler" section (the code it described, `launch_on_ray.py`, was deleted in #5031). `launch.py` guidance kept; prose now points at `lib/iris/OPS.md` for Marin's shared-cluster path. ## Iris doc verification Before rewriting, I audited `experiments/ferries/OPS.md` and `lib/iris/OPS.md` end-to-end against the live `iris` CLI. Every subcommand and flag cited in those docs resolved. Specifically verified: ``` $ uv run iris --help # top-level commands $ uv run iris job run --help # -e/--env-vars, --no-wait, --memory, --extra, etc. $ uv run iris cluster --help # start, stop, restart, dashboard, dashboard-proxy, status, vm, controller, list, start-smoke $ uv run iris cluster controller --help # checkpoint, restart, serve, worker-restart $ uv run iris task exec --help # TASK_ID COMMAND..., --timeout $ uv run iris process --help # status, logs, profile $ uv run iris process profile --help # threads|cpu|mem, -t target, -d duration $ uv run iris rpc controller --help # get-scheduler-state, get-autoscaler-status, etc. $ uv run iris query --help # -f table|json|csv $ uv run iris cluster vm status --help # --scale-group $ uv run iris user budget --help # get, list, set $ uv run iris cluster list # confirmed `marin` resolves to lib/iris/examples/marin.yaml ``` Every flag, example, and cross-reference in the two Iris docs matched the live CLI output. Referenced scripts exist (`scripts/datakit/validate_ferry_outputs.py`, `.github/workflows/marin-datakit-smoke.yaml`) and `lib/iris/docs/priority-bands.md` exists. **No Iris-doc corrections required.** ## Test plan - [x] `./infra/pre-commit.py --all-files --fix` passes (ruff, black, license headers, pyrefly, markdown, yaml, etc.). - [x] `rg 'ray_run|scripts/ray/|ray up|ray down|ray submit|ray job submit|RAY_AUTH_TOKEN' docs/ lib/levanter/docs/ .agents/ experiments/` returns zero hits outside the two historical logbook files (`.agents/projects/20251114_fray_design.md`, `linear_ce_loss.md`) — verified via `Grep` tool. - [x] `rg 'tpu-cluster-setup'` returns zero hits (mkdocs nav + installation.md cross-link updated). - [x] `rg 'Ray-Job-Manager|fray-migration'` returns zero hits (both deleted files unreferenced). - [x] `rg 'launch_on_ray'` returns zero hits in `lib/levanter/docs/`. ## Review handoff — tootsie runbook @dlwh @Helw150 @rjpower — the `experiments/tootsie/BABYSITTING.md` rewrite in this PR is mechanical (preserves `--force_run_failed` / `--run_only` flags verbatim, switches launcher to `uv run iris --cluster=marin job run -- python -m experiments.exp{600,750}_tootsie...`). Please sanity-check the exact flags and cluster targeting — operators have right-of-refusal on the syntax. The doc also drops the `manual_ray_worker_launch.py` "reattach v4-2048" escape hatch and replaces it with the note that Iris handles preemption/restart itself. Speak up if that's not yet a safe claim for the big tootsie v4-2048 runs. Refs: #4453 (parent Ray removal), #5029 (doc sweep tracker). --------- Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
1 parent 4256d06 commit 5f8625c

38 files changed

Lines changed: 131 additions & 840 deletions

.agents/docs/fray-migration.md

Lines changed: 0 additions & 167 deletions
This file was deleted.

.agents/projects/ferry_framework.md

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,7 @@ Every ferry launch/update should include a minimal run record with these fields:
150150
- `script`
151151
- `git_sha`
152152
- `cluster`
153-
- `ray_job_id`
153+
- `iris_job_id`
154154
- `wandb_run_id`
155155
- `wandb_url`
156156
- `start_time`
@@ -179,7 +179,7 @@ Rules:
179179
- handoff is allowed only with an explicit replacement owner and a state handoff containing:
180180
- current job status
181181
- latest error/signals
182-
- last known Ray and W&B links
182+
- last known Iris and W&B links
183183

184184
### Canary Freeze Policy
185185

@@ -274,10 +274,8 @@ gh issue list \
274274
Launch shape (illustrative, to pin in recipe):
275275

276276
```bash
277-
uv run lib/marin/src/marin/run/ray_run.py \
278-
--no_wait \
279-
--cluster us-central1 \
280-
-- python experiments/ferries/daily.py --run_name "daily-125m-$(date +%F)"
277+
uv run iris --cluster=marin job run --no-wait --cpu=1 --memory=2G --extra=cpu \
278+
-- python -m experiments.ferries.daily --run_name "daily-125m-$(date +%F)"
281279
```
282280

283281
Monitoring handoff:
@@ -344,7 +342,7 @@ Phase-2:
344342
## Resolved Decisions
345343

346344
1. Ferry run closure uses a log-only PR (`docs/experiments/daily-ferry-log.md`); proposal/debug details live in issues.
347-
2. Default cluster for now is `us-central1` (Ray CLI cluster key; maps to zone `us-central1-a`).
345+
2. Default cluster for now is `marin` (Iris `--cluster` key, resolves to `lib/iris/examples/marin.yaml`).
348346
3. "Experiment-relevant issues" filter starts with label `experiment` only.
349347
4. "Max 2 knobs changed" remains policy guidance, not script-enforced.
350348
5. Discord automation is deferred to Phase 2.

.agents/skills/architecture/SKILL.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,15 @@ description: Marin architecture overview and repository structure reference. Use
55

66
# Skill: Marin Architecture
77

8-
Marin is a framework for building reproducible language model training pipelines. At its core, Marin executes DAGs of steps using Ray for distributed processing, with automatic versioning based on code and configuration. Pipeline: data curation → transformation → tokenization → training → evaluation.
8+
Marin is a framework for building reproducible language model training pipelines. At its core, Marin executes DAGs of steps using [Fray](https://github.com/marin-community/marin/tree/main/lib/fray) (dispatched onto [Iris](https://github.com/marin-community/marin/blob/main/lib/iris/OPS.md) on shared clusters) for distributed processing, with automatic versioning based on code and configuration. Pipeline: data curation → transformation → tokenization → training → evaluation.
99

1010
## Core Architecture
1111

1212
**Executor Pattern**: Experiments are DAGs of `ExecutorStep` objects (`lib/marin/src/marin/execution/executor.py`). Output path = `<base>/<name>-<hash>` where hash covers versioned fields and dependencies. Only changed steps re-run.
1313

14-
**Ray Distribution**: Steps can be normal or `@ray.remote` functions. Ray ships code to workers with step-specific dependency groups from `pyproject.toml`.
14+
**Fray/Iris Distribution**: Steps that need remote execution wrap their function with `remote()` (see `experiments/defaults.py`). Fray launches each remote step as a sub-job against the current cluster (Iris on shared infra, Local for laptop runs). Step-specific dependency groups are drawn from `pyproject.toml`.
1515

16-
**Entry Point**: `executor_main()` or [`lib/marin/src/marin/run/ray_run.py`](https://github.com/marin-community/marin/blob/main/lib/marin/src/marin/run/ray_run.py) for cluster execution.
16+
**Entry Point**: Call `executor_main()` at the bottom of the script; launch the script itself as a CPU-only Iris job (`uv run iris --cluster=marin job run -- python -m experiments.<script>`) for cluster execution. See [`lib/iris/OPS.md`](https://github.com/marin-community/marin/blob/main/lib/iris/OPS.md) for the full launch reference.
1717

1818
## Repository Structure
1919

@@ -24,7 +24,7 @@ marin/
2424
2525
├── lib/marin/src/marin/ # Core library organized by function
2626
│ ├── execution/ # DAG executor (executor.py, status_actor.py)
27-
│ ├── run/ # Job launchers (ray_run.py, slurm_run.py)
27+
│ ├── run/ # Legacy launcher stubs (slurm_run.py); submit via `iris job run` on shared clusters
2828
│ ├── download/ # Dataset downloaders (huggingface/, ar5iv/, wikipedia/, nemotron_cc/, filesystem/)
2929
│ ├── transform/ # Raw data → text (ar5iv/, stackexchange/, wikipedia/, conversation/, domain-specific)
3030
│ ├── crawl/ # Web crawling (fetch_links.py, minhash/, fineweb_edu/, open_web_math/)
@@ -55,8 +55,8 @@ marin/
5555
│ └── quickstart-data/
5656
5757
├── docs/ # Documentation (tutorials/, explanations/, references/, recipes/, reports/, design/, dev-guide/, model-cards/)
58-
├── infra/ # Ray cluster configs (marin-*.yaml, configure_gcp_registry.py)
59-
├── scripts/ # Utilities (ray/, training/, pm/, debug/, gpu_eval/)
58+
├── infra/ # Cluster configs (configure_gcp_registry.py, configure_temp_buckets.py). Iris cluster configs live under lib/iris/examples/.
59+
├── scripts/ # Utilities (iris/, training/, pm/, debug/, gpu_eval/)
6060
└── docker/ # Docker configs (marin/, levanter/)
6161
```
6262

@@ -72,7 +72,7 @@ marin/
7272
5. **Train** (`lib/marin/src/marin/training/`): Levanter (JAX) on TPU/GPU
7373
6. **Evaluate** (`lib/marin/src/marin/evaluation/`): lm-eval-harness or vLLM
7474

75-
**Cluster Infrastructure** (`infra/README.md`): Ray on GCP, on-demand head + preemptible TPU workers (v4/v5e/v6e), autoscaling 4-1024 workers, managed via `scripts/ray/cluster.py`
75+
**Cluster Infrastructure** ([`lib/iris/OPS.md`](https://github.com/marin-community/marin/blob/main/lib/iris/OPS.md)): Iris on GCP (TPU v4/v5e/v6e) and CoreWeave (H100 GPUs); on-demand controller + autoscaling preemptible workers. Submit jobs with `uv run iris --cluster=marin job run ...`.
7676

7777
**Default Helpers** (`experiments/defaults.py`): `default_download()`, `default_tokenize()`, `default_train()`, `default_eval()`
7878

.agents/skills/ferries/SKILL.md

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ Collect:
3939
1. Last ferry references:
4040
- issue URL
4141
- PR/commit URL
42-
- W&B run URL and Ray job ID
42+
- W&B run URL and Iris job ID
4343
2. Human objective for this interval:
4444
- standard integration pass
4545
- or explicit regression investigation
@@ -124,25 +124,21 @@ Then push the launch commit (no proposal PR by default).
124124
Before launch, confirm requester approval in-thread unless they already gave explicit "launch without asking" permission.
125125

126126
```bash
127-
uv run lib/marin/src/marin/run/ray_run.py \
128-
--no_wait \
129-
--cluster us-central1 \
130-
-- python experiments/ferries/daily.py
127+
uv run iris --cluster=marin job run --no-wait --cpu=1 --memory=2G --extra=cpu \
128+
-- python -m experiments.ferries.daily
131129
```
132130

133131
After launch, capture and post to the issue:
134-
- Ray job id
132+
- Iris job id (printed by `iris job run`, form `/<user>/iris-run-job-YYYYMMDD-HHMMSS`)
135133
- cluster
136134
- launch timestamp
137135
- W&B link(s) when available
138136

139137
Optional deterministic daily rerun name:
140138
```bash
141-
uv run lib/marin/src/marin/run/ray_run.py \
142-
--no_wait \
143-
--cluster us-central1 \
144-
-e FERRY_DATE="$(date +%Y%m%d-%H%M%S)-daily-ferry" \
145-
-- python experiments/ferries/daily.py
139+
uv run iris --cluster=marin job run --no-wait --cpu=1 --memory=2G --extra=cpu \
140+
-e FERRY_DATE "$(date +%Y%m%d-%H%M%S)-daily-ferry" \
141+
-- python -m experiments.ferries.daily
146142
```
147143

148144
#### 5) Monitor to terminal state
@@ -158,7 +154,7 @@ Follow the **babysit-job** skill with:
158154
Post in the ferry issue:
159155
- final status,
160156
- key metrics/regressions,
161-
- Ray job ID and W&B link(s),
157+
- Iris job ID and W&B link(s),
162158
- recommendation for next ferry.
163159
- Optional: post a manual Discord update for major run state changes.
164160

@@ -174,7 +170,7 @@ Required terminal issue comment template:
174170

175171
```markdown
176172
Final status: <SUCCEEDED|FAILED|STOPPED>
177-
Ray job id: <job_id>
173+
Iris job id: <job_id>
178174
W&B link: <url>
179175
Final eval summary: <short summary + key metrics>
180176
Experiment link: <experiment JSON/browser link>

0 commit comments

Comments
 (0)