Skip to content

grug: dispatch through fray jobs (to fix multinode)#3269

Merged
rjpower merged 5 commits intomainfrom
codex/grug-fray-dispatch-main-variants
Mar 5, 2026
Merged

grug: dispatch through fray jobs (to fix multinode)#3269
rjpower merged 5 commits intomainfrom
codex/grug-fray-dispatch-main-variants

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Mar 5, 2026

Summary

  • route merged Grug variants (base, moe) through Fray job dispatch in run_grug(...) instead of ExecutorStep resource dispatch
  • add shared helper experiments/grug/dispatch.py and rename it to dispatch_grug_training_run(...)
  • make GrugRunConfig carry ResourceConfig, so dispatch is resource-driven instead of TPU-variant hardcoding in run_grug

Changes

  • add dispatch_grug_training_run(...) helper to build/submit JobRequest and wait on completion
  • helper infers environment extras from resource device (tpu / gpu) so Iris and Ray both get the right extras
  • in experiments/grug/base/train.py and experiments/grug/moe/train.py:
    • rename existing train body to _run_grug_local(...)
    • add dispatcher run_grug(...) that submits _run_grug_local via Fray
    • extend GrugRunConfig with resources: ResourceConfig
  • in experiments/grug/base/launch.py and experiments/grug/moe/launch.py:
    • remove remote(..., resources=...) usage
    • use plain fn=run_grug_*_trial
    • pass resources through launch config into GrugRunConfig

Validation

  • ./infra/pre-commit.py experiments/grug/dispatch.py experiments/grug/base/train.py experiments/grug/moe/train.py experiments/grug/base/launch.py experiments/grug/moe/launch.py
  • uv run python -m py_compile experiments/grug/dispatch.py experiments/grug/base/train.py experiments/grug/moe/train.py experiments/grug/base/launch.py experiments/grug/moe/launch.py

@dlwh dlwh added the agent-generated Created by automation/agent label Mar 5, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 5, 2026

Claude finished @dlwh's task in 1m 20s —— View job


Review

  • Read changed files
  • Analyze correctness and behavioral regressions
  • Post review

This PR moves TPU resource allocation for grug base and moe variants from ExecutorStep-level remote(fn, resources=...) wrapping to explicit Fray job submission inside run_grug(), with a shared helper in experiments/grug/dispatch.py. The change achieves its stated objectives.

No specification found -- the PR is under 500 lines, so none is expected.

Findings: None. The implementation is clean and correct:

  • dispatch_grug_tpu correctly builds a JobRequest, submits via current_client(), and blocks with raise_on_failure=True.
  • Both launch files correctly drop the remote() wrapper and unused imports (ResourceConfig, remote).
  • TPU variant (v5p-8) is preserved in both train modules, matching prior behavior.
  • run_grug validates trainer.id before dispatch, and trainer.id is set from config.run_id in the TrainerConfig constructor in both launch files, so the guard is satisfied.
  • The executor node now runs the launch function directly and blocks on the Fray job, which is the intended architectural change.

LGTM.

@dlwh dlwh requested review from ClassicLarry and rjpower March 5, 2026 06:58
@dlwh dlwh changed the title grug: dispatch merged variants through fray jobs grug: dispatch through fray jobs (to fix multinode) Mar 5, 2026
@dlwh
Copy link
Copy Markdown
Member Author

dlwh commented Mar 5, 2026

@rjpower how do you feel about the special casing of resources. we need it for iris

Copy link
Copy Markdown
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me, the "magic" remote wrapping is maybe too ray like in general.

@rjpower rjpower merged commit d569a71 into main Mar 5, 2026
15 checks passed
@rjpower rjpower deleted the codex/grug-fray-dispatch-main-variants branch March 5, 2026 15:09
@ClassicLarry
Copy link
Copy Markdown
Contributor

Fixing multi-node is great! I am not sure on feasibility, but long term it would be nice if our launch/train/model files were completely agnostic to the execution framework, and treated 'uv run launch.py' as the default execution path. Then if someone wants to use a Marin-specific execution framework, they can do so through only command line args like 'execution-framework -tpu4-8 -region -grug.my_impl.launch.

So the complexity of the execution framework is all pulled away into the command line arg. If someone wants to do complex DAG relationships, then they can bring back in the complexity of the execution framework into the code to handle that, but that is treated as something only advanced users have to worry about.

I am comfortable with the code as written for myself, but I think run_grug_local and dispatch args will be a head-scratcher for anyone new to the repo.

@dlwh
Copy link
Copy Markdown
Member Author

dlwh commented Mar 5, 2026

I get the desire to simplify, but it'd add a lot of complexity to support both marin and not marin since the whole download->tokenize->train pipeline depends on the dag executor.

ruili33 pushed a commit that referenced this pull request Mar 25, 2026
## Summary
- route merged Grug variants (`base`, `moe`) through Fray job dispatch
in `run_grug(...)` instead of `ExecutorStep` resource dispatch
- add shared helper `experiments/grug/dispatch.py` and rename it to
`dispatch_grug_training_run(...)`
- make `GrugRunConfig` carry `ResourceConfig`, so dispatch is
resource-driven instead of TPU-variant hardcoding in `run_grug`

## Changes
- add `dispatch_grug_training_run(...)` helper to build/submit
`JobRequest` and wait on completion
- helper infers environment extras from resource device (`tpu` / `gpu`)
so Iris and Ray both get the right extras
- in `experiments/grug/base/train.py` and
`experiments/grug/moe/train.py`:
  - rename existing train body to `_run_grug_local(...)`
- add dispatcher `run_grug(...)` that submits `_run_grug_local` via Fray
  - extend `GrugRunConfig` with `resources: ResourceConfig`
- in `experiments/grug/base/launch.py` and
`experiments/grug/moe/launch.py`:
  - remove `remote(..., resources=...)` usage
  - use plain `fn=run_grug_*_trial`
  - pass `resources` through launch config into `GrugRunConfig`

## Validation
- `./infra/pre-commit.py experiments/grug/dispatch.py
experiments/grug/base/train.py experiments/grug/moe/train.py
experiments/grug/base/launch.py experiments/grug/moe/launch.py`
- `uv run python -m py_compile experiments/grug/dispatch.py
experiments/grug/base/train.py experiments/grug/moe/train.py
experiments/grug/base/launch.py experiments/grug/moe/launch.py`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants