-
Notifications
You must be signed in to change notification settings - Fork 71
Add configs and script to launch GRPO jobs to AWS cluster via Slurm #598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #598 +/- ##
==========================================
- Coverage 83.05% 82.69% -0.37%
==========================================
Files 31 31
Lines 4036 3946 -90
==========================================
- Hits 3352 3263 -89
+ Misses 684 683 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
allenwang28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, shall we remove the existing apps/grpo/qwen3_32b.yaml?
|
yes, thanks will do in a separate PR! |
Description
Add three configs for the FAIR AWS cluster, and a script to launch jobs.
Also make certain variables in the SlurmLauncher configurable.
Test Plan
Log onto AWS login node, and git clone torchforge:
Then setup your conda environment:
Run GRPO on Qwen3 8B model
FAIR cluster job: https://www.internalfb.com/fair_hub/job/FAIR_SC/2362697/details
W&B: https://fairwandb.org/agentic-models/grpo-training/runs/yo82qdaq
Run GRPO on Qwen3 32B model
FAIR cluster job: https://www.internalfb.com/fair_hub/job/FAIR_SC/2371720/details
W&B: https://fairwandb.org/agentic-models/grpo-training/runs/ryy2u75n
Run GRPO on Qwen3 30B A3B model
FAIR cluster job: https://www.internalfb.com/fair_hub/job/FAIR_SC/2372099/details
W&B: https://fairwandb.org/agentic-models/grpo-training/runs/qmuo8whb