marin-community
diff --git a/‎.agents/docs/fray-migration.md‎
Lines changed: 0 additions & 167 deletions b/‎.agents/docs/fray-migration.md‎
Lines changed: 0 additions & 167 deletions
diff --git a/‎.agents/projects/ferry_framework.md‎
Lines changed: 5 additions & 7 deletions b/‎.agents/projects/ferry_framework.md‎
Lines changed: 5 additions & 7 deletions
diff --git a/‎.agents/skills/architecture/SKILL.md‎
Lines changed: 7 additions & 7 deletions b/‎.agents/skills/architecture/SKILL.md‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎.agents/skills/ferries/SKILL.md‎
Lines changed: 9 additions & 13 deletions b/‎.agents/skills/ferries/SKILL.md‎
Lines changed: 9 additions & 13 deletions
@@ -150,7 +150,7 @@ Every ferry launch/update should include a minimal run record with these fields:
 - `script`
 - `git_sha`
 - `cluster`
-- `ray_job_id`
+- `iris_job_id`
 - `wandb_run_id`
 - `wandb_url`
 - `start_time`
@@ -179,7 +179,7 @@ Rules:
 - handoff is allowed only with an explicit replacement owner and a state handoff containing:
   - current job status
   - latest error/signals
-  - last known Ray and W&B links
+  - last known Iris and W&B links
 
 ### Canary Freeze Policy
 
@@ -274,10 +274,8 @@ gh issue list \
 Launch shape (illustrative, to pin in recipe):
 
 ```bash
-uv run lib/marin/src/marin/run/ray_run.py \
-  --no_wait \
-  --cluster us-central1 \
-  -- python experiments/ferries/daily.py --run_name "daily-125m-$(date +%F)"
+uv run iris --cluster=marin job run --no-wait --cpu=1 --memory=4G --extra=cpu \
+  -- python -m experiments.ferries.daily --run_name "daily-125m-$(date +%F)"
 ```
 
 Monitoring handoff:
@@ -344,7 +342,7 @@ Phase-2:
 ## Resolved Decisions
 
 1. Ferry run closure uses a log-only PR (`docs/experiments/daily-ferry-log.md`); proposal/debug details live in issues.
-2. Default cluster for now is `us-central1` (Ray CLI cluster key; maps to zone `us-central1-a`).
+2. Default cluster for now is `marin` (Iris `--cluster` key, resolves to `lib/iris/examples/marin.yaml`).
 3. "Experiment-relevant issues" filter starts with label `experiment` only.
 4. "Max 2 knobs changed" remains policy guidance, not script-enforced.
 5. Discord automation is deferred to Phase 2.
@@ -5,15 +5,15 @@ description: Marin architecture overview and repository structure reference. Use
 
 # Skill: Marin Architecture
 
-Marin is a framework for building reproducible language model training pipelines. At its core, Marin executes DAGs of steps using Ray for distributed processing, with automatic versioning based on code and configuration. Pipeline: data curation → transformation → tokenization → training → evaluation.
+Marin is a framework for building reproducible language model training pipelines. At its core, Marin executes DAGs of steps using [Fray](https://github.com/marin-community/marin/tree/main/lib/fray) (dispatched onto [Iris](https://github.com/marin-community/marin/blob/main/lib/iris/OPS.md) on shared clusters) for distributed processing, with automatic versioning based on code and configuration. Pipeline: data curation → transformation → tokenization → training → evaluation.
 
 ## Core Architecture
 
 **Executor Pattern**: Experiments are DAGs of `ExecutorStep` objects (`lib/marin/src/marin/execution/executor.py`). Output path = `<base>/<name>-<hash>` where hash covers versioned fields and dependencies. Only changed steps re-run.
 
-**Ray Distribution**: Steps can be normal or `@ray.remote` functions. Ray ships code to workers with step-specific dependency groups from `pyproject.toml`.
+**Fray/Iris Distribution**: Steps that need remote execution wrap their function with `remote()` (see `experiments/defaults.py`). Fray launches each remote step as a sub-job against the current cluster (Iris on shared infra, Local for laptop runs). Step-specific dependency groups are drawn from `pyproject.toml`.
 
-**Entry Point**: `executor_main()` or [`lib/marin/src/marin/run/ray_run.py`](https://github.com/marin-community/marin/blob/main/lib/marin/src/marin/run/ray_run.py) for cluster execution.
+**Entry Point**: Call `executor_main()` at the bottom of the script; launch the script itself as a CPU-only Iris job (`uv run iris --cluster=marin job run -- python -m experiments.<script>`) for cluster execution. See [`lib/iris/OPS.md`](https://github.com/marin-community/marin/blob/main/lib/iris/OPS.md) for the full launch reference.
 
 ## Repository Structure
 
@@ -24,7 +24,7 @@ marin/
 │
 ├── lib/marin/src/marin/                  # Core library organized by function
 │   ├── execution/              # DAG executor (executor.py, status_actor.py)
-│   ├── run/                    # Job launchers (ray_run.py, slurm_run.py)
+│   ├── run/                    # Legacy launcher stubs (slurm_run.py); submit via `iris job run` on shared clusters
 │   ├── download/               # Dataset downloaders (huggingface/, ar5iv/, wikipedia/, nemotron_cc/, filesystem/)
 │   ├── transform/              # Raw data → text (ar5iv/, stackexchange/, wikipedia/, conversation/, domain-specific)
 │   ├── crawl/                  # Web crawling (fetch_links.py, minhash/, fineweb_edu/, open_web_math/)
@@ -55,8 +55,8 @@ marin/
 │   └── quickstart-data/
 │
 ├── docs/                       # Documentation (tutorials/, explanations/, references/, recipes/, reports/, design/, dev-guide/, model-cards/)
-├── infra/                      # Ray cluster configs (marin-*.yaml, configure_gcp_registry.py)
-├── scripts/                    # Utilities (ray/, training/, pm/, debug/, gpu_eval/)
+├── infra/                      # Cluster configs (configure_gcp_registry.py, configure_temp_buckets.py). Iris cluster configs live under lib/iris/examples/.
+├── scripts/                    # Utilities (iris/, training/, pm/, debug/, gpu_eval/)
 └── docker/                     # Docker configs (marin/, levanter/)
 ```
 
@@ -72,7 +72,7 @@ marin/
 5. **Train** (`lib/marin/src/marin/training/`): Levanter (JAX) on TPU/GPU
 6. **Evaluate** (`lib/marin/src/marin/evaluation/`): lm-eval-harness or vLLM
 
-**Cluster Infrastructure** (`infra/README.md`): Ray on GCP, on-demand head + preemptible TPU workers (v4/v5e/v6e), autoscaling 4-1024 workers, managed via `scripts/ray/cluster.py`
+**Cluster Infrastructure** ([`lib/iris/OPS.md`](https://github.com/marin-community/marin/blob/main/lib/iris/OPS.md)): Iris on GCP (TPU v4/v5e/v6e) and CoreWeave (H100 GPUs); on-demand controller + autoscaling preemptible workers. Submit jobs with `uv run iris --cluster=marin job run ...`.
 
 **Default Helpers** (`experiments/defaults.py`): `default_download()`, `default_tokenize()`, `default_train()`, `default_eval()`
 
 
@@ -39,7 +39,7 @@ Collect:
 1. Last ferry references:
 - issue URL
 - PR/commit URL
-- W&B run URL and Ray job ID
+- W&B run URL and Iris job ID
 2. Human objective for this interval:
 - standard integration pass
 - or explicit regression investigation
@@ -124,25 +124,21 @@ Then push the launch commit (no proposal PR by default).
 Before launch, confirm requester approval in-thread unless they already gave explicit "launch without asking" permission.
 
 ```bash
-uv run lib/marin/src/marin/run/ray_run.py \
-  --no_wait \
-  --cluster us-central1 \
-  -- python experiments/ferries/daily.py
+uv run iris --cluster=marin job run --no-wait --cpu=1 --memory=4G --extra=cpu \
+  -- python -m experiments.ferries.daily
 ```
 
 After launch, capture and post to the issue:
-- Ray job id
+- Iris job id (printed by `iris job run`, form `/<user>/iris-run-job-YYYYMMDD-HHMMSS`)
 - cluster
 - launch timestamp
 - W&B link(s) when available
 
 Optional deterministic daily rerun name:
 ```bash
-uv run lib/marin/src/marin/run/ray_run.py \
-  --no_wait \
-  --cluster us-central1 \
-  -e FERRY_DATE="$(date +%Y%m%d-%H%M%S)-daily-ferry" \
-  -- python experiments/ferries/daily.py
+uv run iris --cluster=marin job run --no-wait --cpu=1 --memory=4G --extra=cpu \
+  -e FERRY_DATE "$(date +%Y%m%d-%H%M%S)-daily-ferry" \
+  -- python -m experiments.ferries.daily
 ```
 
 #### 5) Monitor to terminal state
@@ -158,7 +154,7 @@ Follow the **babysit-job** skill with:
 Post in the ferry issue:
 - final status,
 - key metrics/regressions,
-- Ray job ID and W&B link(s),
+- Iris job ID and W&B link(s),
 - recommendation for next ferry.
 - Optional: post a manual Discord update for major run state changes.
 
@@ -174,7 +170,7 @@ Required terminal issue comment template:
 
 ```markdown
 Final status: <SUCCEEDED|FAILED|STOPPED>
-Ray job id: <job_id>
+Iris job id: <job_id>
 W&B link: <url>
 Final eval summary: <short summary + key metrics>
 Experiment link: <experiment JSON/browser link>