[trainer] feat: Add Nemo-Automodel as alternative training engine#5407
[trainer] feat: Add Nemo-Automodel as alternative training engine#5407HuiyingLi wants to merge 19 commits intoverl-project:mainfrom
Conversation
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
…ompatibility Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new automodel SFT backend, which leverages nemo_automodel for distributed training. The changes include adding the engine implementation, configuration files, and test scripts. I've identified a configuration issue in the test script and a maintainability concern in the engine implementation. Overall, this is a significant feature addition.
| if isinstance(output, torch.Tensor): | ||
| from types import SimpleNamespace | ||
|
|
||
| output = SimpleNamespace(logits=output) |
There was a problem hiding this comment.
The model's output is conditionally wrapped in a SimpleNamespace if it's a raw tensor. This suggests an inconsistent return type from self.module, which can make the code harder to maintain and reason about. It would be more robust to enforce a consistent, structured return type (like CausalLMOutput) from the model to avoid such conditional handling.
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
hi @HuiyingLi , thanks to your great contribution. |
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Hi @ISEEKYAN ,
|
|
great, is your exp on H100? If true, the MFU looks good but it will be better to have a fair comparison with FSDP or Megatron. But this is not a blocker of merging this PR. It is just a good reference for the users to adopt AutoModel, and it will be better to add a doc to show the comparison and an example for the user to easily hand on |
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
|
@HuiyingLi Thanks for your great contribution! Could you please sign the CLA~ |
|
|
||
| **Requirements** | ||
|
|
||
| - Automodel r0.3.0 |
There was a problem hiding this comment.
Maybe in another PR, we should refactor docs/start/install.rst to support all model engines and rollout engine install methods and use some displays for better choices.
|
We should also prepare some CI tests for Nemo-Automodel |
What does this PR do?
Add NeMo-Automodel as a training engine. The SFT trainer is tested with Qwen2.5-0.5B.
Relevant PRs:
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
automodel backend 1gpu & 4gpu tp1/tp2 against FSDP backend 1gpu, rmpad true/false.

automodel backend finetuning moonlight 16B with ep8 8H100

automodel backend finetuning Qwen3 30B with ep8 8H100

automodel backend finetuning qwen2.5-7b with 4H100 fsdp2

API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.