grug: dispatch through fray jobs (to fix multinode)#3269
Conversation
|
Claude finished @dlwh's task in 1m 20s —— View job Review
This PR moves TPU resource allocation for grug No specification found -- the PR is under 500 lines, so none is expected. Findings: None. The implementation is clean and correct:
LGTM. |
|
@rjpower how do you feel about the special casing of resources. we need it for iris |
rjpower
left a comment
There was a problem hiding this comment.
Seems good to me, the "magic" remote wrapping is maybe too ray like in general.
|
Fixing multi-node is great! I am not sure on feasibility, but long term it would be nice if our launch/train/model files were completely agnostic to the execution framework, and treated 'uv run launch.py' as the default execution path. Then if someone wants to use a Marin-specific execution framework, they can do so through only command line args like 'execution-framework -tpu4-8 -region -grug.my_impl.launch. So the complexity of the execution framework is all pulled away into the command line arg. If someone wants to do complex DAG relationships, then they can bring back in the complexity of the execution framework into the code to handle that, but that is treated as something only advanced users have to worry about. I am comfortable with the code as written for myself, but I think run_grug_local and dispatch args will be a head-scratcher for anyone new to the repo. |
|
I get the desire to simplify, but it'd add a lot of complexity to support both marin and not marin since the whole download->tokenize->train pipeline depends on the dag executor. |
## Summary - route merged Grug variants (`base`, `moe`) through Fray job dispatch in `run_grug(...)` instead of `ExecutorStep` resource dispatch - add shared helper `experiments/grug/dispatch.py` and rename it to `dispatch_grug_training_run(...)` - make `GrugRunConfig` carry `ResourceConfig`, so dispatch is resource-driven instead of TPU-variant hardcoding in `run_grug` ## Changes - add `dispatch_grug_training_run(...)` helper to build/submit `JobRequest` and wait on completion - helper infers environment extras from resource device (`tpu` / `gpu`) so Iris and Ray both get the right extras - in `experiments/grug/base/train.py` and `experiments/grug/moe/train.py`: - rename existing train body to `_run_grug_local(...)` - add dispatcher `run_grug(...)` that submits `_run_grug_local` via Fray - extend `GrugRunConfig` with `resources: ResourceConfig` - in `experiments/grug/base/launch.py` and `experiments/grug/moe/launch.py`: - remove `remote(..., resources=...)` usage - use plain `fn=run_grug_*_trial` - pass `resources` through launch config into `GrugRunConfig` ## Validation - `./infra/pre-commit.py experiments/grug/dispatch.py experiments/grug/base/train.py experiments/grug/moe/train.py experiments/grug/base/launch.py experiments/grug/moe/launch.py` - `uv run python -m py_compile experiments/grug/dispatch.py experiments/grug/base/train.py experiments/grug/moe/train.py experiments/grug/base/launch.py experiments/grug/moe/launch.py`
Summary
base,moe) through Fray job dispatch inrun_grug(...)instead ofExecutorStepresource dispatchexperiments/grug/dispatch.pyand rename it todispatch_grug_training_run(...)GrugRunConfigcarryResourceConfig, so dispatch is resource-driven instead of TPU-variant hardcoding inrun_grugChanges
dispatch_grug_training_run(...)helper to build/submitJobRequestand wait on completiontpu/gpu) so Iris and Ray both get the right extrasexperiments/grug/base/train.pyandexperiments/grug/moe/train.py:_run_grug_local(...)run_grug(...)that submits_run_grug_localvia FrayGrugRunConfigwithresources: ResourceConfigexperiments/grug/base/launch.pyandexperiments/grug/moe/launch.py:remote(..., resources=...)usagefn=run_grug_*_trialresourcesthrough launch config intoGrugRunConfigValidation
./infra/pre-commit.py experiments/grug/dispatch.py experiments/grug/base/train.py experiments/grug/moe/train.py experiments/grug/base/launch.py experiments/grug/moe/launch.pyuv run python -m py_compile experiments/grug/dispatch.py experiments/grug/base/train.py experiments/grug/moe/train.py experiments/grug/base/launch.py experiments/grug/moe/launch.py