You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Megatron] Support optimizer offload for moe when ep > 1 (#1638)
### Checklist Before Starting
- [x] Search for similar PR(s).
### What does this PR do?
This simple PR adds support for
[ChainedOptimizer](https://github.com/NVIDIA/Megatron-LM/blob/75b1ca13618bded85c81fb572f58df83ba095dc9/megatron/core/optimizer/optimizer.py#L938)
offloading in the Megatron-LM training environment.
In Megatron-LM, ChainedOptimizer is used when expert parallelism
(expert_parallel > 1, related to #1467 ) is enabled—commonly in
Mixture-of-Experts (MoE) models.
This has been tested and validated with the Qwen3-235B-22A model
configuration.
### High-Level Design
> Demonstrate the high-level design if this PR is complex.
### Specific Changes
> List the specific changes.
### API
> Demonstrate how the API changes if any.
### Usage Example
> Provide usage example(s) for easier usage.
```python
...
actor_rollout_ref.actor.megatron.optimizer_offload=True \
actor_rollout_ref.actor.megatron.expert_model_parallel_size=16 \
...
```
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluatuion results, etc.
### Additional Info.
- **Issue Number**: Fixes issue # or discussion # if any.
- **Training**: [Megatron]
- **Inference**: [none]
### Checklist Before Submitting
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [ ] Add `[BREAKING]` to the PR title if it breaks any API.
- [ ] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if necessary.
---------
Co-authored-by: charlie.cs <[email protected]>
Co-authored-by: ETOgaosion <[email protected]>
0 commit comments