Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
fdeb482
x
erictang000 Apr 29, 2026
2f40ffe
x
erictang000 Apr 29, 2026
496bfb5
[wip] starting point for overnight nemotron3 nano debug
erictang000 Apr 30, 2026
86fe57b
[nemotron3] fix wake_up(kv_cache) OOM for 30B nano test
erictang000 Apr 30, 2026
d3d13ec
[debug] dump bridge-emitted weight names via SKYRL_DUMP_WEIGHT_NAMES
erictang000 Apr 30, 2026
d52a1e7
[debug] dump bucket-ordered broadcast names via SKYRL_DUMP_BROADCAST_…
erictang000 Apr 30, 2026
08c5d4b
[debug] env var to bypass bucketing for nemotron NaN diagnosis
erictang000 Apr 30, 2026
01c4a1d
[docs] running notes on the nemotron3 nano post-sync NaN
erictang000 Apr 30, 2026
a406c02
[debug] nemotron3-nano_tp2_ep2 variant for EP-localization
erictang000 Apr 30, 2026
7e49668
[debug] include value stats in broadcast dump
erictang000 Apr 30, 2026
7dcc5a2
[test] revert diagnostic-only nemotron3-nano_tp2_ep2 variant
erictang000 Apr 30, 2026
4b7e946
[docs] expand nemotron3 nano debug writeup with full findings
erictang000 Apr 30, 2026
1ca719c
[deps] bump vllm 0.19.0 -> 0.20.0, torch 2.10 -> 2.11
erictang000 Apr 30, 2026
7ee0593
[deps] regenerate uv.lock for vllm 0.20 / torch 2.11 upgrade
erictang000 Apr 30, 2026
f4af91d
[deps] use vllm 0.20.0+cu129 wheel; keep torch on cu128
erictang000 Apr 30, 2026
c867a68
[nemotron3][vllm020] force moe_backend=triton for nano test
erictang000 Apr 30, 2026
1e08a0d
[docs] capture vllm 0.20 upgrade results
erictang000 Apr 30, 2026
495cd4a
[nemotron3][vllm020] also set moe_backend=triton for the tiny model
erictang000 Apr 30, 2026
1d79e23
[docs] tiny test passes end-to-end on vllm 0.20
erictang000 Apr 30, 2026
1470e13
Merge branch 'main' of https://github.com/erictang000/SkyRL into nemo…
erictang000 Apr 30, 2026
4a72c42
[docs] capture run17/run18 results on merged stack
erictang000 Apr 30, 2026
96a48a6
x
erictang000 Apr 30, 2026
6a38b86
[nemotron3][vllm020] fix Mamba conv1d corruption + clean up debug ins…
erictang000 Apr 30, 2026
1318ff1
[overnight] start nemotron3_nano gsm8k + dapo runs
erictang000 May 1, 2026
218c625
[overnight] add moe_backend=triton + max_model_len overrides for vllm…
erictang000 May 1, 2026
808e035
[overnight] use inline-dict syntax for engine_init_kwargs override
erictang000 May 1, 2026
9842a52
[overnight] run03 reward=0 at step 1, monitoring
erictang000 May 1, 2026
8d5a5b0
[overnight] disable thinking mode for gsm8k (was burning all 1024 tok…
erictang000 May 1, 2026
4cb4ecc
[overnight] thinking back on + tight sampling + smaller batch
erictang000 May 1, 2026
4c615b0
[overnight] default gsm8k scoring to 'flexible' (extracts last number)
erictang000 May 1, 2026
d00eda4
[overnight] log run06 disk-full + uv cache move to /mnt/nvme
erictang000 May 1, 2026
76b4977
[overnight] move all uv cache subdirs to /mnt/nvme (run07 hit EXDEV)
erictang000 May 1, 2026
858d61e
[overnight] symlink ~/.cache/uv root to nvme; subdir symlinks aren't …
erictang000 May 1, 2026
840c360
[overnight] document run09 degenerate output + start standalone vllm …
erictang000 May 1, 2026
d35c58d
[overnight] try legacy inference path (_SKYRL_USE_NEW_INFERENCE=0)
erictang000 May 1, 2026
262dcb5
[overnight] async_engine=false to dodge OpenAIServingRender API misma…
erictang000 May 1, 2026
697b5b5
[overnight] step 1 reward = 0.940! legacy sync path works
erictang000 May 1, 2026
3380f3b
[overnight] step 2 reward 0.952 (+0.012). reward rising
erictang000 May 1, 2026
dfff3b7
[overnight] dapo: same _SKYRL_USE_NEW_INFERENCE=0 fix as gsm8k
erictang000 May 1, 2026
76a77f4
[overnight] step 4: 0.952. trajectory oscillating around ceiling
erictang000 May 1, 2026
5cb2815
[overnight] step 5+6 + eval: validation 0.953
erictang000 May 1, 2026
ddf69d0
[overnight] step 11 + eval@10: validation plateaued at 0.95. plan DAP…
erictang000 May 1, 2026
8f7cad7
[overnight] gsm8k 16 steps + 3 evals. validation flat at 0.952. cutti…
erictang000 May 1, 2026
432ecb1
[overnight] DAPO launch: bump eval_interval 5->10 to limit eval overhead
erictang000 May 1, 2026
0b49c58
[overnight] DAPO baseline AIME pass@32 = 0.50 (15/30 problems solved)
erictang000 May 1, 2026
ef0281a
[overnight] DAPO run02: shrink micro batches + expandable_segments af…
erictang000 May 1, 2026
7d3a90e
[overnight] DAPO run03: drop expandable_segments (vLLM incompatible),…
erictang000 May 1, 2026
0fdd0af
[overnight] DAPO run03 step 1 OK: pass@16=0.375, no OOM, 25min/step
erictang000 May 1, 2026
b75b272
[overnight] DAPO trajectory through step 4: pass@16 0.375 -> 0.391 (r…
erictang000 May 1, 2026
29c3483
[overnight] DAPO step 6 new peak: pass@16=0.445 (+0.070 vs step 1)
erictang000 May 1, 2026
2fec2f9
[overnight] DAPO 8 steps: peak pass@16=0.445 at step 6, mean ~0.378
erictang000 May 1, 2026
e00407f
[overnight] DAPO step 10 = 0.422 (new peak). final summary + TL;DR
erictang000 May 1, 2026
4bbd4c1
[overnight] DAPO eval@10: pass@32 0.30 -> 0.333 (+3.3pp), mean_pos +44%
erictang000 May 1, 2026
282c268
[overnight] DAPO step 11 = 0.484 pass@16, +11pp vs step 1
erictang000 May 1, 2026
b982ed6
[overnight] DAPO step 12 = 0.539 pass@16 (+16.4pp). still climbing
erictang000 May 1, 2026
903353a
[overnight] DAPO step 13-14: 0.453, 0.484. settling around 0.48 band
erictang000 May 1, 2026
b7b5184
[overnight] DAPO step 15 = 0.523. mean of last 5 = 0.501 vs first 5 =…
erictang000 May 1, 2026
d5c4545
[overnight] DAPO step 16-17: 0.531, 0.539. mean of last 7 = 0.508 (+1…
erictang000 May 1, 2026
c4962c6
[overnight] DAPO step 18 = 0.672 pass@16 (+29.7pp). massive jump
erictang000 May 1, 2026
cc01899
[overnight] DAPO step 20 = 0.719 pass@16 (+34.4pp vs step 1). eval@20…
erictang000 May 1, 2026
81e2fa5
[overnight] DAPO eval@20: AIME pass@32 = 0.500 (+20pp absolute, +67% …
erictang000 May 1, 2026
898e94d
[overnight] DAPO step 22 = 0.727 pass@16 (+35.2pp). steady gains cont…
erictang000 May 1, 2026
33d9873
[overnight] DAPO step 23-25: pass@16 peak now 0.742 (+36.7pp). still …
erictang000 May 1, 2026
831c3ca
[overnight] DAPO step 29 = 0.797 pass@16 (+42.2pp). still climbing
erictang000 May 1, 2026
c71d173
[overnight] DAPO eval@30: AIME pass@32 = 0.567 (17/30, +26.7pp). exce…
erictang000 May 1, 2026
30bc58b
[overnight] DAPO step 31-34: pass@16 peak now 0.844 (+46.9pp). platea…
erictang000 May 1, 2026
4aca79a
[overnight] DAPO eval@40 regression: 0.567 -> 0.433 (overfit signal)
erictang000 May 2, 2026
43dbd79
[overnight 8k+offload] init branch: 8k MAX_RESPONSE + optimizer cpu o…
erictang000 May 2, 2026
17b03af
[overnight 8k+offload] note nccl.h fix; run01 died at build, run02 la…
erictang000 May 2, 2026
869aee9
[overnight 8k+offload] run02 healthy: build done, eval@step0 in progr…
erictang000 May 2, 2026
1b71eb7
[overnight 8k+offload] eval@0: AIME pass@32 = 0.533 (16/30). step 1 g…
erictang000 May 2, 2026
8359098
[overnight 8k+offload] step 1 gen: pass@16 = 0.586 (+21pp vs 4k step …
erictang000 May 2, 2026
5d3514e
[overnight 8k+offload] step 1 complete: 48.3 min total. ~30 steps in …
erictang000 May 2, 2026
fa642f2
[overnight 8k+offload] step 2 gen: pass@16 = 0.656 (+7pp over step 1)
erictang000 May 2, 2026
db7cd6f
[overnight 8k+offload] step 2 done: 46.0 min (2.3 min faster than ste…
erictang000 May 2, 2026
3097525
[overnight 8k+offload] step 3 gen: pass@16 = 0.594 (-6pp from step 2,…
erictang000 May 2, 2026
e9cbe2d
[overnight 8k+offload] step 3 done: 47.0 min. mean 47.1 min/step over…
erictang000 May 2, 2026
ac46b03
[overnight 8k+offload] step 4 gen: pass@16 = 0.586. mean steps 1-4 = …
erictang000 May 2, 2026
721a601
[overnight 8k+offload] step 4 done: 46.1 min. mean 46.9 min/step over…
erictang000 May 2, 2026
8ec425b
[overnight 8k+offload] step 5 gen: pass@16 = 0.625 (+4pp over step 4)
erictang000 May 2, 2026
c807184
[overnight 8k+offload] step 5 done: 45.3 min (fastest yet). mean 46.5…
erictang000 May 2, 2026
f395414
[overnight 8k+offload] step 6 gen: pass@16 = 0.547 (oscillation; mean…
erictang000 May 3, 2026
087de91
[overnight 8k+offload] step 6 done: 45.6 min. mean 46.4 min/step over…
erictang000 May 3, 2026
1b3f4c4
[overnight 8k+offload] step 7 gen: pass@16 = 0.570. mean steps 1-7 = …
erictang000 May 3, 2026
08ea586
[overnight 8k+offload] step 7 done: 46.3 min. mean still 46.4 min/step
erictang000 May 3, 2026
95dddeb
[overnight 8k+offload] step 8 gen: pass@16 = 0.617 (+5pp over step 7)
erictang000 May 3, 2026
51dfb30
[overnight 8k+offload] step 8 done: 45.0 min (new fastest). mean 46.2…
erictang000 May 3, 2026
5569bb4
[overnight 8k+offload] step 9 gen: pass@16 = 0.648 (peak; +8pp jump).…
erictang000 May 3, 2026
094769b
[overnight 8k+offload] step 9 done: 44.1 min (new fastest). step 10 +…
erictang000 May 3, 2026
4c65ce4
[overnight 8k+offload] step 10 gen: pass@16 = 0.742 (+9pp). RL gradie…
erictang000 May 3, 2026
692764b
[overnight 8k+offload] step 10 done: 44.7 min. eval@10 running
erictang000 May 3, 2026
a2cc262
[overnight 8k+offload] eval@10: AIME pass@32 = 0.600 (18/30, +6.7pp o…
erictang000 May 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
511 changes: 511 additions & 0 deletions .claude/runs/PROGRESS.md

Large diffs are not rendered by default.

72 changes: 72 additions & 0 deletions .claude/settings.local.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
{
"permissions": {
"allow": [
"Bash(git add:*)",
"Bash(git commit:*)",
"Bash(git push:*)",
"Bash(git status:*)",
"Bash(git diff:*)",
"Bash(git log:*)",
"Bash(git branch:*)",
"Bash(git checkout:*)",
"Bash(git fetch:*)",
"Bash(git pull:*)",
"Bash(git remote:*)",
"Bash(git stash:*)",
"Bash(git rev-parse:*)",
"Bash(git config:*)",
"Bash(gh auth:*)",
"Bash(gh pr:*)",
"Bash(gh repo:*)",
"Bash(gh api:*)",
"Bash(tail:*)",
"Bash(head:*)",
"Bash(grep:*)",
"Bash(find:*)",
"Bash(awk:*)",
"Bash(sed:*)",
"Bash(cut:*)",
"Bash(sort:*)",
"Bash(wc:*)",
"Bash(ls:*)",
"Bash(cat:*)",
"Bash(stat:*)",
"Bash(file:*)",
"Bash(du:*)",
"Bash(df:*)",
"Bash(pwd:*)",
"Bash(echo:*)",
"Bash(printf:*)",
"Bash(date:*)",
"Bash(uptime:*)",
"Bash(free:*)",
"Bash(uname:*)",
"Bash(env:*)",
"Bash(ps:*)",
"Bash(pgrep:*)",
"Bash(pkill:*)",
"Bash(kill:*)",
"Bash(nvidia-smi:*)",
"Bash(jq:*)",
"Bash(zcat:*)",
"Bash(gunzip:*)",
"Bash(mkdir:*)",
"Bash(rmdir:*)",
"Bash(touch:*)",
"Bash(ln:*)",
"Bash(readlink:*)",
"Bash(realpath:*)",
"Bash(which:*)",
"Bash(test:*)",
"Bash(cp:*)",
"Bash(mv:*)",
"Bash(rm:*)",
"Bash(curl:*)",
"Bash(wget:*)",
"Bash(uv run:*)",
"Bash(uv pip:*)",
"Bash(bash examples/train/megatron/run_megatron_dapo_nemotron3_nano.sh:*)",
"Bash(./examples/train/megatron/run_megatron_dapo_nemotron3_nano.sh:*)"
]
}
}
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.12
70 changes: 70 additions & 0 deletions PROGRESS_8k_offload.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# DAPO Nemotron3-Nano 8k+offload Overnight Run

Branch: `nemotron3_nano_8k_offload_overnight` (forked from `nemotron3_nano_overnight_runs` @ `4aca79ab`).

Purpose: continuation of the prior 4k overnight run. The 4k run hit step 40 with **AIME pass@32 trajectory 0.300 → 0.567 (peak @ step 30) → 0.433 (step 40)** — overfit signal. This run flips two knobs to attack the truncation/overfit cost simultaneously:

1. `MAX_RESPONSE_LENGTH` 4096 → **8192**: AIME problems often need >4k tokens. The prior 4k baseline only solved 9/30 (vs 15/30 at 8k) before any RL — RL closed the gap (17/30 @ step 30) but truncation is a structural ceiling.
2. `OPTIMIZER_CPU_OFFLOAD=true` + `optimizer_offload_fraction=1.0`: makes 8k fit. Prior 8k attempt (`dapo_run01`) OOM'd at step 1 train. CPU-offloading the optimizer state (precision-aware AdamW with d2h/h2d overlap) frees GPU for activations.
3. `engine_init_kwargs.max_model_len`: 8192 → **12000** (matches new 2k prompt + 8k response + slack).

Hardware: 8x B200, 183 GB each. Megatron TP=4, PP=1, CP=1, EP=8, ETP=1.

Logs: `/mnt/nvme/etang/runs/dapo_8k_offload_run<NN>.log` (12T nvme — root only has 140G and uv cache eats it fast).

Wandb: project `dapo_nemotron3_nano`, run name `dapo_nemotron3_nano_30b_a3b_base_megatron_tp4_pp1_cp1_ep8_etp1_optim_offload_8k_max_response_length`.

## Per-step time budget

8k+offload step 1 was **48 min** (vs ~25 min at 4k). At this rate the 24h budget gets us:
- step 1 done: 20:42 UTC 5/2
- eval@10 expected ~04:42 UTC 5/3
- step 20 expected ~12:42 UTC 5/3
- eval@30 unlikely to fit (would land ~20:42 UTC 5/3 — past 24h budget)

If gen speeds up after step 1's vLLM compile cache warms (4k showed gen drop from 28→15 min after step 1), per-step could compress to ~35-40 min and eval@30 becomes reachable. Will track from step 2.

## Hypotheses to test

- Does optimizer offload + 8k actually fit? (prior 4k run with no offload + micro_train=1 fit fine; 8k previously OOM'd on step 1.)
- Does an 8k cap eliminate the val regression seen at step 30→40 in 4k? (theory: model was learning to truncate aggressively, which started hurting AIME accuracy on long problems by step 40.)
- What's the per-step time? 4k was ~25 min/step; 8k will be slower from generation + activations, but optimizer offload eats some of that back.
- Eval baseline at 8k cap is 0.50 pass@32 (from `dapo_run01` step 0). Does this run beat 0.567 (the 4k-cap step-30 peak)?

## Run log

### Spot-instance setup notes (one-time)

- nvme remounted fresh on this instance — moved `~/.cache/uv` → `/mnt/nvme/etang/uv-cache-real` (24G, was eating the 194G root); symlinked `~/exports` and `~/ckpts` to `/mnt/nvme/etang/{exports,ckpts}` so dumped_evals don't race against root fill.
- **transformer-engine-torch source build needed `nccl.h`.** No precompiled wheel exists for this torch+cuda combo (cu12.9, torch 2.11). The `--isolated` build env's `-I/usr/local/cuda/include` lacks nccl headers (cuda 12.9 install doesn't bundle them; nccl ships separately via `nccl-gib` package at `/usr/local/gib/`). Fix: `sudo ln -sf /usr/local/gib/include/nccl.h /usr/local/cuda/include/nccl.h` + corresponding libnccl.so symlinks. Done once — persists in /usr/local/cuda which survives the spot lifetime as long as cuda doesn't get upgraded.
- run01 died at this build step. run02 is the first real attempt.

### run01 (2026-05-02 19:21 UTC) — DIED at build (nccl.h missing)

See note above. Symlinked nccl into cuda dir, restarted as run02.

### run02 (2026-05-02 19:26 UTC) — running

- 19:26 launch → 19:30 build done (transformer-engine-torch + mamba-ssm)
- 19:35 ray actor groups initialized, mesh ranks set (TP=4 × DP=2)
- 19:37 init policy/ref/critic done. weight sync 9.7s
- 19:37:34 **eval@step0 started**
- Wandb: https://wandb.ai/sky-posttraining-uc-berkeley/dapo_nemotron3_nano/runs/7p8ir69t
- GPU mem 138-139 GB / 183 GB per device (~75% — fits with 8k headroom)
- Disk: root 102G/194G (62G HF cache for 30B BF16 model is the bulk; stable). nvme 37G/12T.

| step | pass@16 / pass@32 | raw_reward / avg_score | mean_pos_reward | gen (s) | train (s) | sync (s) | notes |
|------|-------------------|------------------------|-----------------|---------|-----------|----------|-------|
| 0 (eval) | pass@32 **0.533** (16/30) | avg_score -0.431 | 0.284 | — | — | 9.7 (init) | 8k cap, avg 7229 tokens, correct 4939. Beats 4k baseline 0.30 and run01's 0.50. Eval took 934s (15.6 min). |
| 1 (train batch) | pass@16 **0.586** | -0.743 | 0.372 | 1635 (27.3 min) | 1247 (20.8 min) | 9.4 | **Total step 1: 2900s = 48.3 min.** Train breakdown: fwd_logprobs 297s + compute_adv 0.3s + policy_train 950s. +21pp pass@16 vs 4k step 1; +57pp raw_reward thanks to less overlong penalty at 8k; mean_pos +6.7x. |
| 2 (train batch) | pass@16 **0.656** | -0.800 | 0.348 | 1675 (27.9 min) | 1066 (17.8 min) | 9.8 | **Total step 2: 2759s = 46.0 min** (-2.3 min vs step 1). fwd_logprobs 237s (-60s) + policy_train 829s (-121s, ~13% torch-compile warmup). +7pp pass@16. |
| 3 (train batch) | pass@16 **0.594** | -1.132 | 0.237 | 1718 (28.6 min) | 1079 (18.0 min) | 9.7 | **Total step 3: 2817s = 47.0 min.** policy_train 837s. -6pp pass@16 vs step 2 — noise band. |
| 4 (train batch) | pass@16 **0.586** | -0.951 | 0.292 | 1679 (28.0 min) | 1076 (17.9 min) | 9.7 | **Total step 4: 2765s = 46.1 min.** Mean steps 1-4 pass@16 = 0.606 (vs 0.371 mean of 4k steps 1-4 = +23.5pp). Mean step time 46.9 min. |
| 5 (train batch) | pass@16 **0.625** | -0.840 | 0.334 | 1650 (27.5 min) | 1056 (17.6 min) | 9.8 | **Total step 5: 2715s = 45.3 min — fastest yet.** policy_train 810s. +4pp over step 4. Mean step time over 1-5: 46.5 min. |
| 6 (train batch) | pass@16 **0.547** | -0.968 | 0.297 | 1655 (27.6 min) | 1062 (17.7 min) | 9.3 | **Total step 6: 2734s = 45.6 min.** Mean steps 1-6: 0.599 (vs 4k mean 1-6 = 0.387, +21pp). Mean step time 46.4 min. |
| 7 (train batch) | pass@16 **0.570** | -1.000 | 0.279 | 1690 (28.2 min) | 1078 (18.0 min) | 9.5 | **Total step 7: 2777s = 46.3 min.** Trend: 0.586, 0.656, 0.594, 0.586, 0.625, 0.547, 0.570 — pass@16 stuck around 0.59 mean. Need many more steps to see real RL gradient. |
| 8 (train batch) | pass@16 **0.617** | -0.815 | 0.342 | 1645 (27.4 min) | 1035 (17.2 min) | 9.3 | **Total step 8: 2698s = 45.0 min — new fastest.** policy_train 803s. +5pp over step 7. Mean steps 1-8: 0.598. Mean step time 46.2 min. |
| 9 (train batch) | pass@16 **0.648** | -0.701 | 0.386 | 1599 (26.7 min, fastest) | 1030 (17.2 min) | 9.3 | **Total step 9: 2646s = 44.1 min — new fastest.** policy_train 800s. New peak pass@16. Mean steps 1-9: 0.604. Mean step time 46.0 min. |
| 10 (train batch) | pass@16 **0.742** | -0.526 | 0.425 | 1642 (27.4 min) | 1024 (17.1 min) | 9.6 | **Total step 10: 2683s = 44.7 min.** policy_train 796s. **Big jump: +9pp over step 9, +16pp over step 1.** Mean steps 1-10: 0.617. Mean step time 45.8 min. |
| 10 (eval) | pass@32 **0.600** (18/30) | avg_score -0.298 | 0.351 | — | — | 820s eval | **+6.7pp over baseline 0.533.** avg tokens 6943 (vs 7229 baseline → -286), correct-answer 4710 (vs 4939 → -229) — slightly shorter responses, clear improvement. Already past 4k step 20 (0.500). |

144 changes: 144 additions & 0 deletions examples/train/megatron/run_megatron_dapo_nemotron3_nano.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
set -x

# Use the legacy (non-chunked) inference path to avoid the vLLM 0.20
# layerwise-reload corruption that derails post-sync generation for nemotron_h.
# See PROGRESS.md / gsm8k_run09 → run11 for the diagnosis.
export _SKYRL_USE_NEW_INFERENCE=0
# NOTE: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True is incompatible with
# vLLM's CuMemAllocator (assertion in vllm/device_allocator/cumem.py:132,
# pytorch/pytorch#147851). Rely on smaller micro batches + shorter
# MAX_RESPONSE_LENGTH instead.

# Colocated DAPO training+generation for Nemotron3-Nano-30B-A3B on DAPO with Megatron.
# Should run on 1 node of 8xB2000

# bash examples/train/algorithms/dapo/prepare_dapo_data.sh
# bash examples/train/megatron/run_megatron_dapo_nemotron3_nano.sh

MODEL_NAME="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
DATA_DIR="$HOME/data/dapo"
TRAIN_FILE="$DATA_DIR/dapo-math-17k-cleaned.parquet"
TEST_FILE="$DATA_DIR/aime-2024-cleaned.parquet"
NUM_NODES=1
NUM_GPUS_PER_NODE=8
NUM_INFERENCE_ENGINES=1
INFERENCE_ENGINE_TENSOR_PARALLEL_SIZE=8
LOGGER="wandb" # change to "console" to print to stdout

CLIP_RATIO_LOW=0.2
CLIP_RATIO_HIGH=0.28
# use token mean loss reduction
LOSS_REDUCTION="token_mean"
# applies overlong filtering (but not soft overlong punishment)
APPLY_OVERLONG_FILTERING=true
# apply soft overlong punishment with custom trainer impl in main_dapo.py
OVERLONG_BUFFER_LEN=$((1024 * 2))
OVERLONG_BUFFER_PENALTY_FACTOR=1.0

# other DAPO parameters
USE_KL_LOSS=false
TEMPERATURE=1.0
TOP_P=1.0
EVAL_TOP_P=0.7
CLIP_RATIO_C=10.0
MAX_PROMPT_LENGTH=$((1024 * 2))
# Reduced from 8192 to 4096 for the overnight smoke run — full 8k responses
# pushed Megatron's packed activations OOM (run01) and we don't have headroom
# at this batch size. AIME problems usually fit in 4k.
MAX_RESPONSE_LENGTH=$((1024 * 8))

# repro run parameters
TRAIN_BATCH_SIZE=128
MINI_BATCH_SIZE=32
N_SAMPLES_PER_PROMPT=16
EVAL_N_SAMPLES_PER_PROMPT=32
ENFORCE_EAGER=true # cuda graphs can cause some instability
LR=1e-6

# megatron config
MEGATRON_TP=4
MEGATRON_PP=1
MEGATRON_CP=1
MEGATRON_EP=8
MEGATRON_ETP=1


# TIS parameters
TIS_IMP_RATIO_CAP=2.0
TIS_TYPE=token

OPTIMIZER_OFFLOAD_FRACTION=1.0
OPTIMIZER_CPU_OFFLOAD=true

uv run --isolated --extra megatron -m examples.train.algorithms.dapo.main_dapo \
data.train_data="['$TRAIN_FILE']" \
data.val_data="['$TEST_FILE']" \
trainer.algorithm.advantage_estimator="grpo" \
trainer.algorithm.policy_loss_type="dual_clip" \
trainer.algorithm.overlong_buffer_len=$OVERLONG_BUFFER_LEN \
trainer.algorithm.overlong_buffer_penalty_factor=$OVERLONG_BUFFER_PENALTY_FACTOR \
trainer.algorithm.loss_reduction=$LOSS_REDUCTION \
generator.inference_engine.enforce_eager=$ENFORCE_EAGER \
generator.apply_overlong_filtering=$APPLY_OVERLONG_FILTERING \
generator.sampling_params.temperature=$TEMPERATURE \
generator.sampling_params.top_p=$TOP_P \
generator.eval_sampling_params.top_p=$EVAL_TOP_P \
generator.eval_sampling_params.temperature=$TEMPERATURE \
generator.eval_sampling_params.max_generate_length=$MAX_RESPONSE_LENGTH \
trainer.algorithm.use_kl_loss=$USE_KL_LOSS \
trainer.algorithm.clip_ratio_c=$CLIP_RATIO_C \
trainer.policy.model.path="$MODEL_NAME" \
trainer.placement.colocate_all=true \
trainer.strategy=megatron \
trainer.placement.policy_num_nodes=$NUM_NODES \
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS_PER_NODE \
generator.inference_engine.num_engines=$NUM_INFERENCE_ENGINES \
generator.inference_engine.tensor_parallel_size=$INFERENCE_ENGINE_TENSOR_PARALLEL_SIZE \
trainer.policy.megatron_config.tensor_model_parallel_size=$MEGATRON_TP \
trainer.policy.megatron_config.pipeline_model_parallel_size=$MEGATRON_PP \
trainer.policy.megatron_config.context_parallel_size=$MEGATRON_CP \
trainer.policy.megatron_config.expert_model_parallel_size=$MEGATRON_EP \
trainer.policy.megatron_config.expert_tensor_parallel_size=$MEGATRON_ETP \
trainer.policy.megatron_config.optimizer_config_kwargs.optimizer_offload_fraction=$OPTIMIZER_OFFLOAD_FRACTION \
trainer.policy.megatron_config.optimizer_config_kwargs.optimizer_cpu_offload=$OPTIMIZER_CPU_OFFLOAD \
trainer.policy.megatron_config.optimizer_config_kwargs.use_precision_aware_optimizer=$OPTIMIZER_CPU_OFFLOAD \
trainer.policy.megatron_config.optimizer_config_kwargs.overlap_cpu_optimizer_d2h_h2d=$OPTIMIZER_CPU_OFFLOAD \
trainer.algorithm.off_policy_correction.tis_ratio_type=$TIS_TYPE \
trainer.algorithm.off_policy_correction.token_tis_ratio_clip_high=$TIS_IMP_RATIO_CAP \
trainer.epochs=20 \
trainer.algorithm.eps_clip_low=$CLIP_RATIO_LOW \
trainer.algorithm.eps_clip_high=$CLIP_RATIO_HIGH \
trainer.eval_batch_size=1024 \
trainer.eval_before_train=true \
trainer.eval_interval=10 \
trainer.update_epochs_per_batch=1 \
trainer.train_batch_size=$TRAIN_BATCH_SIZE \
trainer.policy_mini_batch_size=$MINI_BATCH_SIZE \
trainer.micro_forward_batch_size_per_gpu=2 \
trainer.micro_train_batch_size_per_gpu=1 \
trainer.ckpt_interval=-1 \
trainer.max_prompt_length=$MAX_PROMPT_LENGTH \
generator.sampling_params.max_generate_length=$MAX_RESPONSE_LENGTH \
trainer.policy.optimizer_config.lr=$LR \
trainer.policy.optimizer_config.num_warmup_steps=40 \
trainer.policy.optimizer_config.weight_decay=0.1 \
trainer.policy.optimizer_config.max_grad_norm=1.0 \
generator.inference_engine.backend=vllm \
generator.inference_engine.run_engines_locally=true \
generator.inference_engine.weight_sync_backend=nccl \
generator.inference_engine.async_engine=false \
generator.batched=true \
environment.env_class=aime \
generator.n_samples_per_prompt=$N_SAMPLES_PER_PROMPT \
generator.eval_n_samples_per_prompt=$EVAL_N_SAMPLES_PER_PROMPT \
generator.inference_engine.gpu_memory_utilization=0.6 \
generator.inference_engine.engine_init_kwargs="{moe_backend: triton, max_model_len: 12000}" \
trainer.logger="$LOGGER" \
trainer.project_name="dapo_nemotron3_nano" \
trainer.run_name="dapo_nemotron3_nano_30b_a3b_base_megatron_tp${MEGATRON_TP}_pp${MEGATRON_PP}_cp${MEGATRON_CP}_ep${MEGATRON_EP}_etp${MEGATRON_ETP}_optim_offload_8k_max_response_length" \
trainer.export_path="$HOME/exports/dapo_nemotron3_nano_30b_a3b_base_megatron_tp${MEGATRON_TP}_pp${MEGATRON_PP}_cp${MEGATRON_CP}_ep${MEGATRON_EP}_etp${MEGATRON_ETP}_optim_offload_8k_max_response_length" \
trainer.hf_save_interval=-1 \
trainer.resume_mode=latest \
trainer.max_ckpts_to_keep=3 \
trainer.ckpt_path="$HOME/ckpts/dapo_nemotron3_nano_30b_a3b_base_megatron_tp${MEGATRON_TP}_pp${MEGATRON_PP}_cp${MEGATRON_CP}_ep${MEGATRON_EP}_etp${MEGATRON_ETP}_optim_offload_8k_max_response_length" \
$@
87 changes: 87 additions & 0 deletions examples/train/megatron/run_megatron_nemotron3_nano.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
set -x

# Use the legacy (non-chunked) inference path. The new path goes through
# vLLM's layerwise reload, which re-runs `process_weights_after_loading` and
# (likely) re-creates view-buffer aliases that corrupt MoE/conv weights for
# nemotron_h beyond the `conv_weights` skip we already added. Standalone
# vLLM with HF weights at T=0.7 produces correct gsm8k answers; post-Megatron-
# sync vLLM produces degenerate output. Legacy path uses CUDA IPC + direct
# model.load_weights, no reload machinery.
export _SKYRL_USE_NEW_INFERENCE=0

# Colocated GRPO training+generation for Nemotron3-Nano-30B-A3B on GSM8K with Megatron.

# uv run examples/train/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
# export WANDB_API_KEY=<your_key_here>
# bash examples/train/megatron/run_megatron_nemotron3_nano.sh

DATA_DIR="$HOME/data/gsm8k"
LOGGER="wandb" # change to "console" to print to stdout
MODEL_NAME="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"

INFERENCE_BACKEND="vllm" # currently only vllm is supported for megatron

NUM_NODES=1
NUM_GPUS=8

MEGATRON_TP=4
MEGATRON_PP=1
MEGATRON_CP=1
MEGATRON_EP=8
MEGATRON_ETP=1

INFERENCE_ENGINE_TP=8

# # Qwen3.5 flags
# USE_SAMPLE_PACKING=false # sample packing is not yet supported for GDN layers in megatron - see: https://github.com/NVIDIA/Megatron-LM/pull/2644

uv run --isolated --extra megatron -m skyrl.train.entrypoints.main_base \
data.train_data="['$DATA_DIR/train.parquet']" \
data.val_data="['$DATA_DIR/validation.parquet']" \
trainer.algorithm.advantage_estimator="grpo" \
trainer.policy.model.path=$MODEL_NAME \
trainer.placement.colocate_all=true \
trainer.strategy=megatron \
trainer.placement.policy_num_nodes=$NUM_NODES \
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
trainer.placement.critic_num_gpus_per_node=$NUM_GPUS \
trainer.placement.ref_num_gpus_per_node=$NUM_GPUS \
generator.inference_engine.num_engines=1 \
generator.inference_engine.tensor_parallel_size=$INFERENCE_ENGINE_TP \
trainer.policy.megatron_config.tensor_model_parallel_size=$MEGATRON_TP \
trainer.policy.megatron_config.pipeline_model_parallel_size=$MEGATRON_PP \
trainer.policy.megatron_config.context_parallel_size=$MEGATRON_CP \
trainer.policy.megatron_config.expert_model_parallel_size=$MEGATRON_EP \
trainer.policy.megatron_config.expert_tensor_parallel_size=$MEGATRON_ETP \
trainer.use_sample_packing=true \
trainer.epochs=20 \
trainer.eval_batch_size=256 \
trainer.eval_before_train=false \
trainer.eval_interval=5 \
trainer.update_epochs_per_batch=1 \
trainer.train_batch_size=256 \
trainer.policy_mini_batch_size=64 \
trainer.micro_forward_batch_size_per_gpu=4 \
trainer.micro_train_batch_size_per_gpu=4 \
trainer.ckpt_interval=-1 \
trainer.max_prompt_length=512 \
generator.sampling_params.max_generate_length=3000 \
generator.sampling_params.temperature=0.7 \
generator.sampling_params.top_p=0.9 \
trainer.policy.optimizer_config.lr=1.0e-6 \
trainer.algorithm.use_kl_loss=true \
generator.inference_engine.backend=$INFERENCE_BACKEND \
generator.inference_engine.run_engines_locally=true \
generator.inference_engine.weight_sync_backend=nccl \
generator.inference_engine.async_engine=false \
generator.batched=true \
environment.env_class=gsm8k \
generator.n_samples_per_prompt=5 \
generator.inference_engine.gpu_memory_utilization=0.6 \
generator.inference_engine.engine_init_kwargs="{moe_backend: triton, max_model_len: 4096}" \
trainer.logger="$LOGGER" \
trainer.project_name="nemotron3_nano" \
trainer.run_name="nemotron3_nano_megatron" \
trainer.resume_mode=null \
trainer.ckpt_path="$HOME/ckpts/nemotron3_nano_megatron_ckpt" \
$@
Loading
Loading