Skip to content

Conversation

@daniellepintz
Copy link
Contributor

@daniellepintz daniellepintz commented Nov 20, 2025

Description

Add three configs for the FAIR AWS cluster, and a script to launch jobs.
Also make certain variables in the SlurmLauncher configurable.

Test Plan

Log onto AWS login node, and git clone torchforge:

git clone [email protected]:meta-pytorch/torchforge.git

Then setup your conda environment:

conda create -n forge python=3.12
conda activate forge
./scripts/install.sh

Run GRPO on Qwen3 8B model

./apps/grpo/slurm/submit.sh qwen3_8b

FAIR cluster job: https://www.internalfb.com/fair_hub/job/FAIR_SC/2362697/details
W&B: https://fairwandb.org/agentic-models/grpo-training/runs/yo82qdaq

Run GRPO on Qwen3 32B model

./apps/grpo/slurm/submit.sh qwen3_32b

FAIR cluster job: https://www.internalfb.com/fair_hub/job/FAIR_SC/2371720/details
W&B: https://fairwandb.org/agentic-models/grpo-training/runs/ryy2u75n

Run GRPO on Qwen3 30B A3B model

./apps/grpo/slurm/submit.sh qwen3_30b_a3b

FAIR cluster job: https://www.internalfb.com/fair_hub/job/FAIR_SC/2372099/details
W&B: https://fairwandb.org/agentic-models/grpo-training/runs/qmuo8whb

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 20, 2025
@daniellepintz daniellepintz changed the title Dp/aws fair Add configs and script to launch GRPO jobs to AWS cluster via Slurm Dec 11, 2025
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.69%. Comparing base (7b8580a) to head (e0ba693).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #598      +/-   ##
==========================================
- Coverage   83.05%   82.69%   -0.37%     
==========================================
  Files          31       31              
  Lines        4036     3946      -90     
==========================================
- Hits         3352     3263      -89     
+ Misses        684      683       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@daniellepintz daniellepintz marked this pull request as ready for review December 12, 2025 13:50
Copy link
Contributor

@allenwang28 allenwang28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, shall we remove the existing apps/grpo/qwen3_32b.yaml?

@daniellepintz
Copy link
Contributor Author

yes, thanks will do in a separate PR!

@daniellepintz daniellepintz merged commit 445bf59 into main Dec 12, 2025
8 checks passed
@daniellepintz daniellepintz deleted the dp/aws_fair branch December 12, 2025 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants