grug: dispatch through fray jobs (to fix multinode) by dlwh · Pull Request #3269 · marin-community/marin

dlwh · 2026-03-05T06:41:55Z

Summary

route merged Grug variants (base, moe) through Fray job dispatch in run_grug(...) instead of ExecutorStep resource dispatch
add shared helper experiments/grug/dispatch.py and rename it to dispatch_grug_training_run(...)
make GrugRunConfig carry ResourceConfig, so dispatch is resource-driven instead of TPU-variant hardcoding in run_grug

Changes

add dispatch_grug_training_run(...) helper to build/submit JobRequest and wait on completion
helper infers environment extras from resource device (tpu / gpu) so Iris and Ray both get the right extras
in experiments/grug/base/train.py and experiments/grug/moe/train.py:
- rename existing train body to _run_grug_local(...)
- add dispatcher run_grug(...) that submits _run_grug_local via Fray
- extend GrugRunConfig with resources: ResourceConfig
in experiments/grug/base/launch.py and experiments/grug/moe/launch.py:
- remove remote(..., resources=...) usage
- use plain fn=run_grug_*_trial
- pass resources through launch config into GrugRunConfig

Validation

./infra/pre-commit.py experiments/grug/dispatch.py experiments/grug/base/train.py experiments/grug/moe/train.py experiments/grug/base/launch.py experiments/grug/moe/launch.py
uv run python -m py_compile experiments/grug/dispatch.py experiments/grug/base/train.py experiments/grug/moe/train.py experiments/grug/base/launch.py experiments/grug/moe/launch.py

claude · 2026-03-05T06:42:13Z

Claude finished @dlwh's task in 1m 20s —— View job

Review

Read changed files
Analyze correctness and behavioral regressions
Post review

This PR moves TPU resource allocation for grug base and moe variants from ExecutorStep-level remote(fn, resources=...) wrapping to explicit Fray job submission inside run_grug(), with a shared helper in experiments/grug/dispatch.py. The change achieves its stated objectives.

No specification found -- the PR is under 500 lines, so none is expected.

Findings: None. The implementation is clean and correct:

dispatch_grug_tpu correctly builds a JobRequest, submits via current_client(), and blocks with raise_on_failure=True.
Both launch files correctly drop the remote() wrapper and unused imports (ResourceConfig, remote).
TPU variant (v5p-8) is preserved in both train modules, matching prior behavior.
run_grug validates trainer.id before dispatch, and trainer.id is set from config.run_id in the TrainerConfig constructor in both launch files, so the guard is satisfied.
The executor node now runs the launch function directly and blocks on the Fray job, which is the intended architectural change.

LGTM.

dlwh · 2026-03-05T06:59:17Z

@rjpower how do you feel about the special casing of resources. we need it for iris

rjpower

Seems good to me, the "magic" remote wrapping is maybe too ray like in general.

ClassicLarry · 2026-03-05T17:50:04Z

Fixing multi-node is great! I am not sure on feasibility, but long term it would be nice if our launch/train/model files were completely agnostic to the execution framework, and treated 'uv run launch.py' as the default execution path. Then if someone wants to use a Marin-specific execution framework, they can do so through only command line args like 'execution-framework -tpu4-8 -region -grug.my_impl.launch.

So the complexity of the execution framework is all pulled away into the command line arg. If someone wants to do complex DAG relationships, then they can bring back in the complexity of the execution framework into the code to handle that, but that is treated as something only advanced users have to worry about.

I am comfortable with the code as written for myself, but I think run_grug_local and dispatch args will be a head-scratcher for anyone new to the repo.

dlwh · 2026-03-05T19:03:57Z

I get the desire to simplify, but it'd add a lot of complexity to support both marin and not marin since the whole download->tokenize->train pipeline depends on the dag executor.

## Summary - route merged Grug variants (`base`, `moe`) through Fray job dispatch in `run_grug(...)` instead of `ExecutorStep` resource dispatch - add shared helper `experiments/grug/dispatch.py` and rename it to `dispatch_grug_training_run(...)` - make `GrugRunConfig` carry `ResourceConfig`, so dispatch is resource-driven instead of TPU-variant hardcoding in `run_grug` ## Changes - add `dispatch_grug_training_run(...)` helper to build/submit `JobRequest` and wait on completion - helper infers environment extras from resource device (`tpu` / `gpu`) so Iris and Ray both get the right extras - in `experiments/grug/base/train.py` and `experiments/grug/moe/train.py`: - rename existing train body to `_run_grug_local(...)` - add dispatcher `run_grug(...)` that submits `_run_grug_local` via Fray - extend `GrugRunConfig` with `resources: ResourceConfig` - in `experiments/grug/base/launch.py` and `experiments/grug/moe/launch.py`: - remove `remote(..., resources=...)` usage - use plain `fn=run_grug_*_trial` - pass `resources` through launch config into `GrugRunConfig` ## Validation - `./infra/pre-commit.py experiments/grug/dispatch.py experiments/grug/base/train.py experiments/grug/moe/train.py experiments/grug/base/launch.py experiments/grug/moe/launch.py` - `uv run python -m py_compile experiments/grug/dispatch.py experiments/grug/base/train.py experiments/grug/moe/train.py experiments/grug/base/launch.py experiments/grug/moe/launch.py`

grug: dispatch merged variants through fray jobs

90d7bed

dlwh added the agent-generated Created by automation/agent label Mar 5, 2026

grug: drive dispatch from GrugRunConfig resources

cbe7679

dlwh requested review from ClassicLarry and rjpower March 5, 2026 06:58

dlwh changed the title ~~grug: dispatch merged variants through fray jobs~~ grug: dispatch through fray jobs (to fix multinode) Mar 5, 2026

dlwh added 3 commits March 4, 2026 23:04

grug: bundle parallel-attn variant with fray dispatch

b1afbd3

grug: add resources to remaining launch/run config callsites

c04bae1

grug: remove parallel-attn variant from framework PR

97491f4

rjpower approved these changes Mar 5, 2026

View reviewed changes

rjpower merged commit d569a71 into main Mar 5, 2026
15 checks passed

rjpower deleted the codex/grug-fray-dispatch-main-variants branch March 5, 2026 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grug: dispatch through fray jobs (to fix multinode)#3269

grug: dispatch through fray jobs (to fix multinode)#3269
rjpower merged 5 commits intomainfrom
codex/grug-fray-dispatch-main-variants

dlwh commented Mar 5, 2026 •

edited

Loading

Uh oh!

claude Bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

dlwh commented Mar 5, 2026

Uh oh!

rjpower left a comment

Uh oh!

Uh oh!

ClassicLarry commented Mar 5, 2026

Uh oh!

dlwh commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dlwh commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation

Uh oh!

claude Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Uh oh!

dlwh commented Mar 5, 2026

Uh oh!

rjpower left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ClassicLarry commented Mar 5, 2026

Uh oh!

dlwh commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dlwh commented Mar 5, 2026 •

edited

Loading

claude Bot commented Mar 5, 2026 •

edited

Loading