Skip to content

Commit 416fec3

Browse files
committed
docs: retire Ray launcher references; route to Iris
Sweep out `ray_run.py` / `scripts/ray/*` / `ray up` / `ray down` references from user-facing docs, agent skills, runbooks, and experiment docstrings, in preparation for the Ray cluster retirement (#4453). Replacements point at the live Iris launcher pattern documented in `experiments/ferries/OPS.md`: uv run iris --cluster=marin job run ... Rewrites (preserve prior flags/run-selection semantics): - docs/explanations/executor.md — Ray section rewritten as Fray/Iris - docs/tutorials/train-an-lm.md, train-dpo.md — ray_run snippets → iris job run - docs/recipes/add_scaling_heuristic.md — two ray_run commands → iris job run - docs/tutorials/storage-bucket.md, local-gpu.md, first-experiment.md, executor-101.md, installation.md — prose references → Iris/Fray - docs/explanations/{evaluation,experiments,guidelines,marin-prefix}.md, references/resource-config.md, harbor-integration.md — prose scrubs - .agents/skills/ferries/SKILL.md — daily launch cmd + "Ray job id" labels - .agents/skills/architecture/SKILL.md — entry point + infra references - .agents/projects/ferry_framework.md — launch shape + run-record fields - experiments/tootsie/BABYSITTING.md — five runbook snippets (propose-then-handoff; review requested from dlwh/Helw150/rjpower) - experiments/grug/README.md, experiments/README_sft.md - experiments/ferries/daily.py — prose docstring - experiments/tutorials/exp1077_reproduce_dclm_1b1x.py, exp1078_reproduce_dclm_7b1x.py — docstring example - experiments/rollout_data/*.py (7 files) — identical `Usage:` docstring Deletions (docs describing soon-retired code): - docs/dev-guide/rebuilding-cluster.md — entirely about `ray up`/ `scripts/ray/cluster.py` rebuild flow - docs/tutorials/tpu-cluster-setup.md — whole tutorial is "ray up / ray submit / ray dashboard"; removed from mkdocs.yml nav; installation.md link retargeted at lib/iris/OPS.md - lib/levanter/docs/design/Ray-Job-Manager.md — design doc for the Ray TPU job manager (deleted in #5031) - .agents/docs/fray-migration.md — past migration plan, superseded Levanter `Getting-Started-TPU-VM.md`: removed the "Using the Ray Autoscaler" section (launch_on_ray.py was deleted in #5031), kept `launch.py` guidance. Scope is doc-only. Not touched in this commit: `scripts/ray/*`, `lib/marin/src/marin/run/ray_run.py`, `infra/marin-*.yaml`, `infra/README.md`'s "Maintaining a Ray Cluster" section, and Ray references inside historical design/logbook docs (`.agents/projects/20251114_fray_design.md`, `linear_ce_loss.md`). Those belong to other retirement stages. Refs: #4453 (parent), #5029 (doc sweep tracker)
1 parent 75d22e5 commit 416fec3

37 files changed

Lines changed: 175 additions & 794 deletions

.agents/docs/fray-migration.md

Lines changed: 0 additions & 167 deletions
This file was deleted.

.agents/projects/ferry_framework.md

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,7 @@ Every ferry launch/update should include a minimal run record with these fields:
150150
- `script`
151151
- `git_sha`
152152
- `cluster`
153-
- `ray_job_id`
153+
- `iris_job_id`
154154
- `wandb_run_id`
155155
- `wandb_url`
156156
- `start_time`
@@ -179,7 +179,7 @@ Rules:
179179
- handoff is allowed only with an explicit replacement owner and a state handoff containing:
180180
- current job status
181181
- latest error/signals
182-
- last known Ray and W&B links
182+
- last known Iris and W&B links
183183

184184
### Canary Freeze Policy
185185

@@ -274,10 +274,8 @@ gh issue list \
274274
Launch shape (illustrative, to pin in recipe):
275275

276276
```bash
277-
uv run lib/marin/src/marin/run/ray_run.py \
278-
--no_wait \
279-
--cluster us-central1 \
280-
-- python experiments/ferries/daily.py --run_name "daily-125m-$(date +%F)"
277+
uv run iris --cluster=marin job run --no-wait --cpu=1 --memory=4G --extra=cpu \
278+
-- python -m experiments.ferries.daily --run_name "daily-125m-$(date +%F)"
281279
```
282280

283281
Monitoring handoff:
@@ -344,7 +342,7 @@ Phase-2:
344342
## Resolved Decisions
345343

346344
1. Ferry run closure uses a log-only PR (`docs/experiments/daily-ferry-log.md`); proposal/debug details live in issues.
347-
2. Default cluster for now is `us-central1` (Ray CLI cluster key; maps to zone `us-central1-a`).
345+
2. Default cluster for now is `marin` (Iris `--cluster` key, resolves to `lib/iris/examples/marin.yaml`).
348346
3. "Experiment-relevant issues" filter starts with label `experiment` only.
349347
4. "Max 2 knobs changed" remains policy guidance, not script-enforced.
350348
5. Discord automation is deferred to Phase 2.

.agents/skills/architecture/SKILL.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,15 @@ description: Marin architecture overview and repository structure reference. Use
55

66
# Skill: Marin Architecture
77

8-
Marin is a framework for building reproducible language model training pipelines. At its core, Marin executes DAGs of steps using Ray for distributed processing, with automatic versioning based on code and configuration. Pipeline: data curation → transformation → tokenization → training → evaluation.
8+
Marin is a framework for building reproducible language model training pipelines. At its core, Marin executes DAGs of steps using [Fray](https://github.com/marin-community/marin/tree/main/lib/fray) (dispatched onto [Iris](https://github.com/marin-community/marin/blob/main/lib/iris/OPS.md) on shared clusters) for distributed processing, with automatic versioning based on code and configuration. Pipeline: data curation → transformation → tokenization → training → evaluation.
99

1010
## Core Architecture
1111

1212
**Executor Pattern**: Experiments are DAGs of `ExecutorStep` objects (`lib/marin/src/marin/execution/executor.py`). Output path = `<base>/<name>-<hash>` where hash covers versioned fields and dependencies. Only changed steps re-run.
1313

14-
**Ray Distribution**: Steps can be normal or `@ray.remote` functions. Ray ships code to workers with step-specific dependency groups from `pyproject.toml`.
14+
**Fray/Iris Distribution**: Steps that need remote execution wrap their function with `remote()` (see `experiments/defaults.py`). Fray launches each remote step as a sub-job against the current cluster (Iris on shared infra, Local for laptop runs). Step-specific dependency groups are drawn from `pyproject.toml`.
1515

16-
**Entry Point**: `executor_main()` or [`lib/marin/src/marin/run/ray_run.py`](https://github.com/marin-community/marin/blob/main/lib/marin/src/marin/run/ray_run.py) for cluster execution.
16+
**Entry Point**: Call `executor_main()` at the bottom of the script; launch the script itself as a CPU-only Iris job (`uv run iris --cluster=marin job run -- python -m experiments.<script>`) for cluster execution. See [`lib/iris/OPS.md`](https://github.com/marin-community/marin/blob/main/lib/iris/OPS.md) for the full launch reference.
1717

1818
## Repository Structure
1919

@@ -24,7 +24,7 @@ marin/
2424
2525
├── lib/marin/src/marin/ # Core library organized by function
2626
│ ├── execution/ # DAG executor (executor.py, status_actor.py)
27-
│ ├── run/ # Job launchers (ray_run.py, slurm_run.py)
27+
│ ├── run/ # Legacy launcher stubs (slurm_run.py); submit via `iris job run` on shared clusters
2828
│ ├── download/ # Dataset downloaders (huggingface/, ar5iv/, wikipedia/, nemotron_cc/, filesystem/)
2929
│ ├── transform/ # Raw data → text (ar5iv/, stackexchange/, wikipedia/, conversation/, domain-specific)
3030
│ ├── crawl/ # Web crawling (fetch_links.py, minhash/, fineweb_edu/, open_web_math/)
@@ -55,8 +55,8 @@ marin/
5555
│ └── quickstart-data/
5656
5757
├── docs/ # Documentation (tutorials/, explanations/, references/, recipes/, reports/, design/, dev-guide/, model-cards/)
58-
├── infra/ # Ray cluster configs (marin-*.yaml, configure_gcp_registry.py)
59-
├── scripts/ # Utilities (ray/, training/, pm/, debug/, gpu_eval/)
58+
├── infra/ # Cluster configs (configure_gcp_registry.py, configure_temp_buckets.py). Iris cluster configs live under lib/iris/examples/.
59+
├── scripts/ # Utilities (iris/, training/, pm/, debug/, gpu_eval/)
6060
└── docker/ # Docker configs (marin/, levanter/)
6161
```
6262

@@ -72,7 +72,7 @@ marin/
7272
5. **Train** (`lib/marin/src/marin/training/`): Levanter (JAX) on TPU/GPU
7373
6. **Evaluate** (`lib/marin/src/marin/evaluation/`): lm-eval-harness or vLLM
7474

75-
**Cluster Infrastructure** (`infra/README.md`): Ray on GCP, on-demand head + preemptible TPU workers (v4/v5e/v6e), autoscaling 4-1024 workers, managed via `scripts/ray/cluster.py`
75+
**Cluster Infrastructure** ([`lib/iris/OPS.md`](https://github.com/marin-community/marin/blob/main/lib/iris/OPS.md)): Iris on GCP (TPU v4/v5e/v6e) and CoreWeave (H100 GPUs); on-demand controller + autoscaling preemptible workers. Submit jobs with `uv run iris --cluster=marin job run ...`.
7676

7777
**Default Helpers** (`experiments/defaults.py`): `default_download()`, `default_tokenize()`, `default_train()`, `default_eval()`
7878

.agents/skills/ferries/SKILL.md

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ Collect:
3939
1. Last ferry references:
4040
- issue URL
4141
- PR/commit URL
42-
- W&B run URL and Ray job ID
42+
- W&B run URL and Iris job ID
4343
2. Human objective for this interval:
4444
- standard integration pass
4545
- or explicit regression investigation
@@ -124,25 +124,21 @@ Then push the launch commit (no proposal PR by default).
124124
Before launch, confirm requester approval in-thread unless they already gave explicit "launch without asking" permission.
125125

126126
```bash
127-
uv run lib/marin/src/marin/run/ray_run.py \
128-
--no_wait \
129-
--cluster us-central1 \
130-
-- python experiments/ferries/daily.py
127+
uv run iris --cluster=marin job run --no-wait --cpu=1 --memory=4G --extra=cpu \
128+
-- python -m experiments.ferries.daily
131129
```
132130

133131
After launch, capture and post to the issue:
134-
- Ray job id
132+
- Iris job id (printed by `iris job run`, form `/<user>/iris-run-job-YYYYMMDD-HHMMSS`)
135133
- cluster
136134
- launch timestamp
137135
- W&B link(s) when available
138136

139137
Optional deterministic daily rerun name:
140138
```bash
141-
uv run lib/marin/src/marin/run/ray_run.py \
142-
--no_wait \
143-
--cluster us-central1 \
144-
-e FERRY_DATE="$(date +%Y%m%d-%H%M%S)-daily-ferry" \
145-
-- python experiments/ferries/daily.py
139+
uv run iris --cluster=marin job run --no-wait --cpu=1 --memory=4G --extra=cpu \
140+
-e FERRY_DATE "$(date +%Y%m%d-%H%M%S)-daily-ferry" \
141+
-- python -m experiments.ferries.daily
146142
```
147143

148144
#### 5) Monitor to terminal state
@@ -158,7 +154,7 @@ Follow the **babysit-job** skill with:
158154
Post in the ferry issue:
159155
- final status,
160156
- key metrics/regressions,
161-
- Ray job ID and W&B link(s),
157+
- Iris job ID and W&B link(s),
162158
- recommendation for next ferry.
163159
- Optional: post a manual Discord update for major run state changes.
164160

@@ -174,7 +170,7 @@ Required terminal issue comment template:
174170

175171
```markdown
176172
Final status: <SUCCEEDED|FAILED|STOPPED>
177-
Ray job id: <job_id>
173+
Iris job id: <job_id>
178174
W&B link: <url>
179175
Final eval summary: <short summary + key metrics>
180176
Experiment link: <experiment JSON/browser link>

0 commit comments

Comments
 (0)