-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[fsdp, fully-async] feat: add validate process on trainer node when use_trainer_do_validate=True … #4683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the functionality to perform validation on the trainer node, controlled by the use_trainer_do_validate flag. The changes span across configuration, worker logic, data handling, and synchronization. Key modifications include adding the use_trainer_do_validate flag, conditionally assigning the ActorRollout role to the trainer, splitting the validation dataset, and updating the trainer to merge validation metrics. The overall approach is sound, but I've identified a few critical issues related to robustness and potential runtime errors that should be addressed. Specifically, there are concerns with how unique actor names are generated, how validation results are merged, and how datasets are split. These are detailed in the comments.
b1f602e to
4c116d2
Compare
Co-authored-by: Shangwei-Li <[email protected]>
4c116d2 to
5b5d5cd
Compare
What does this PR do?
User Trainer node to do validate process when run mode on fully-async, It can save time for validate computing and reduce perf/time_of_step peak
dapo_7b_math_fsdp2_8_8.sh, it can improve about 1X speedprocess_validation_metrics()on_validate()process, when input datasets len=1444, it latency reduce from 150+s to 40+sChecklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
dapo_7b_math_fsdp2_8_8.shaddasync_training.use_trainer_do_validate=Truecommand to do computedapo_7b_math_fsdp2_8_8.shdapo_7b_math_fsdp2_8_8.sh+async_training.use_trainer_do_validate=Trueasync_training.use_trainer_do_validate=True)async_training.use_trainer_do_validate=False)API and Usage Example
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)