erictang000 · erictang000 · Apr 29, 2026 · Apr 29, 2026 · Apr 30, 2026 · Apr 30, 2026
diff --git a/.claude/runs/PROGRESS.md b/.claude/runs/PROGRESS.md
diff --git a/.claude/settings.local.json b/.claude/settings.local.json
@@ -0,0 +1,72 @@
+{
+  "permissions": {
+    "allow": [
+      "Bash(git add:*)",
+      "Bash(git commit:*)",
+      "Bash(git push:*)",
+      "Bash(git status:*)",
+      "Bash(git diff:*)",
+      "Bash(git log:*)",
+      "Bash(git branch:*)",
+      "Bash(git checkout:*)",
+      "Bash(git fetch:*)",
+      "Bash(git pull:*)",
+      "Bash(git remote:*)",
+      "Bash(git stash:*)",
+      "Bash(git rev-parse:*)",
+      "Bash(git config:*)",
+      "Bash(gh auth:*)",
+      "Bash(gh pr:*)",
+      "Bash(gh repo:*)",
+      "Bash(gh api:*)",
+      "Bash(tail:*)",
+      "Bash(head:*)",
+      "Bash(grep:*)",
+      "Bash(find:*)",
+      "Bash(awk:*)",
+      "Bash(sed:*)",
+      "Bash(cut:*)",
+      "Bash(sort:*)",
+      "Bash(wc:*)",
+      "Bash(ls:*)",
+      "Bash(cat:*)",
+      "Bash(stat:*)",
+      "Bash(file:*)",
+      "Bash(du:*)",
+      "Bash(df:*)",
+      "Bash(pwd:*)",
+      "Bash(echo:*)",
+      "Bash(printf:*)",
+      "Bash(date:*)",
+      "Bash(uptime:*)",
+      "Bash(free:*)",
+      "Bash(uname:*)",
+      "Bash(env:*)",
+      "Bash(ps:*)",
+      "Bash(pgrep:*)",
+      "Bash(pkill:*)",
+      "Bash(kill:*)",
+      "Bash(nvidia-smi:*)",
+      "Bash(jq:*)",
+      "Bash(zcat:*)",
+      "Bash(gunzip:*)",
+      "Bash(mkdir:*)",
+      "Bash(rmdir:*)",
+      "Bash(touch:*)",
+      "Bash(ln:*)",
+      "Bash(readlink:*)",
+      "Bash(realpath:*)",
+      "Bash(which:*)",
+      "Bash(test:*)",
+      "Bash(cp:*)",
+      "Bash(mv:*)",
+      "Bash(rm:*)",
+      "Bash(curl:*)",
+      "Bash(wget:*)",
+      "Bash(uv run:*)",
+      "Bash(uv pip:*)",
+      "Bash(bash examples/train/megatron/run_megatron_dapo_nemotron3_nano.sh:*)",
+      "Bash(./examples/train/megatron/run_megatron_dapo_nemotron3_nano.sh:*)"
+    ]
+  }
+}
diff --git a/.python-version b/.python-version
@@ -0,0 +1 @@
+3.12
diff --git a/PROGRESS_8k_offload.md b/PROGRESS_8k_offload.md
@@ -0,0 +1,70 @@
+# DAPO Nemotron3-Nano 8k+offload Overnight Run
+
+Branch: `nemotron3_nano_8k_offload_overnight` (forked from `nemotron3_nano_overnight_runs` @ `4aca79ab`).
+
+Purpose: continuation of the prior 4k overnight run. The 4k run hit step 40 with **AIME pass@32 trajectory 0.300 → 0.567 (peak @ step 30) → 0.433 (step 40)** — overfit signal. This run flips two knobs to attack the truncation/overfit cost simultaneously:
+
+1. `MAX_RESPONSE_LENGTH` 4096 → **8192**: AIME problems often need >4k tokens. The prior 4k baseline only solved 9/30 (vs 15/30 at 8k) before any RL — RL closed the gap (17/30 @ step 30) but truncation is a structural ceiling.
+2. `OPTIMIZER_CPU_OFFLOAD=true` + `optimizer_offload_fraction=1.0`: makes 8k fit. Prior 8k attempt (`dapo_run01`) OOM'd at step 1 train. CPU-offloading the optimizer state (precision-aware AdamW with d2h/h2d overlap) frees GPU for activations.
+3. `engine_init_kwargs.max_model_len`: 8192 → **12000** (matches new 2k prompt + 8k response + slack).
+
+Hardware: 8x B200, 183 GB each. Megatron TP=4, PP=1, CP=1, EP=8, ETP=1.
+
+Logs: `/mnt/nvme/etang/runs/dapo_8k_offload_run<NN>.log` (12T nvme — root only has 140G and uv cache eats it fast).
+
+Wandb: project `dapo_nemotron3_nano`, run name `dapo_nemotron3_nano_30b_a3b_base_megatron_tp4_pp1_cp1_ep8_etp1_optim_offload_8k_max_response_length`.
+
+## Per-step time budget
+
+8k+offload step 1 was **48 min** (vs ~25 min at 4k). At this rate the 24h budget gets us:
+- step 1 done: 20:42 UTC 5/2
+- eval@10 expected ~04:42 UTC 5/3
+- step 20 expected ~12:42 UTC 5/3
+- eval@30 unlikely to fit (would land ~20:42 UTC 5/3 — past 24h budget)
+
+If gen speeds up after step 1's vLLM compile cache warms (4k showed gen drop from 28→15 min after step 1), per-step could compress to ~35-40 min and eval@30 becomes reachable. Will track from step 2.
+
+## Hypotheses to test
+
+- Does optimizer offload + 8k actually fit? (prior 4k run with no offload + micro_train=1 fit fine; 8k previously OOM'd on step 1.)
+- Does an 8k cap eliminate the val regression seen at step 30→40 in 4k? (theory: model was learning to truncate aggressively, which started hurting AIME accuracy on long problems by step 40.)
+- What's the per-step time? 4k was ~25 min/step; 8k will be slower from generation + activations, but optimizer offload eats some of that back.
+- Eval baseline at 8k cap is 0.50 pass@32 (from `dapo_run01` step 0). Does this run beat 0.567 (the 4k-cap step-30 peak)?
+
+## Run log
+
+### Spot-instance setup notes (one-time)
+
+- nvme remounted fresh on this instance — moved `~/.cache/uv` → `/mnt/nvme/etang/uv-cache-real` (24G, was eating the 194G root); symlinked `~/exports` and `~/ckpts` to `/mnt/nvme/etang/{exports,ckpts}` so dumped_evals don't race against root fill.
+- **transformer-engine-torch source build needed `nccl.h`.** No precompiled wheel exists for this torch+cuda combo (cu12.9, torch 2.11). The `--isolated` build env's `-I/usr/local/cuda/include` lacks nccl headers (cuda 12.9 install doesn't bundle them; nccl ships separately via `nccl-gib` package at `/usr/local/gib/`). Fix: `sudo ln -sf /usr/local/gib/include/nccl.h /usr/local/cuda/include/nccl.h` + corresponding libnccl.so symlinks. Done once — persists in /usr/local/cuda which survives the spot lifetime as long as cuda doesn't get upgraded.
+- run01 died at this build step. run02 is the first real attempt.
+
+### run01 (2026-05-02 19:21 UTC) — DIED at build (nccl.h missing)
+
+See note above. Symlinked nccl into cuda dir, restarted as run02.
+
+### run02 (2026-05-02 19:26 UTC) — running
+
+- 19:26 launch → 19:30 build done (transformer-engine-torch + mamba-ssm)
+- 19:35 ray actor groups initialized, mesh ranks set (TP=4 × DP=2)
+- 19:37 init policy/ref/critic done. weight sync 9.7s
+- 19:37:34 **eval@step0 started**
+- Wandb: https://wandb.ai/sky-posttraining-uc-berkeley/dapo_nemotron3_nano/runs/7p8ir69t
+- GPU mem 138-139 GB / 183 GB per device (~75% — fits with 8k headroom)
+- Disk: root 102G/194G (62G HF cache for 30B BF16 model is the bulk; stable). nvme 37G/12T.
+
+| step | pass@16 / pass@32 | raw_reward / avg_score | mean_pos_reward | gen (s) | train (s) | sync (s) | notes |
+|------|-------------------|------------------------|-----------------|---------|-----------|----------|-------|
+| 0 (eval) | pass@32 **0.533** (16/30) | avg_score -0.431 | 0.284 | — | — | 9.7 (init) | 8k cap, avg 7229 tokens, correct 4939. Beats 4k baseline 0.30 and run01's 0.50. Eval took 934s (15.6 min). |
+| 1 (train batch) | pass@16 **0.586** | -0.743 | 0.372 | 1635 (27.3 min) | 1247 (20.8 min) | 9.4 | **Total step 1: 2900s = 48.3 min.** Train breakdown: fwd_logprobs 297s + compute_adv 0.3s + policy_train 950s. +21pp pass@16 vs 4k step 1; +57pp raw_reward thanks to less overlong penalty at 8k; mean_pos +6.7x. |
+| 2 (train batch) | pass@16 **0.656** | -0.800 | 0.348 | 1675 (27.9 min) | 1066 (17.8 min) | 9.8 | **Total step 2: 2759s = 46.0 min** (-2.3 min vs step 1). fwd_logprobs 237s (-60s) + policy_train 829s (-121s, ~13% torch-compile warmup). +7pp pass@16. |
+| 3 (train batch) | pass@16 **0.594** | -1.132 | 0.237 | 1718 (28.6 min) | 1079 (18.0 min) | 9.7 | **Total step 3: 2817s = 47.0 min.** policy_train 837s. -6pp pass@16 vs step 2 — noise band. |
+| 4 (train batch) | pass@16 **0.586** | -0.951 | 0.292 | 1679 (28.0 min) | 1076 (17.9 min) | 9.7 | **Total step 4: 2765s = 46.1 min.** Mean steps 1-4 pass@16 = 0.606 (vs 0.371 mean of 4k steps 1-4 = +23.5pp). Mean step time 46.9 min. |
+| 5 (train batch) | pass@16 **0.625** | -0.840 | 0.334 | 1650 (27.5 min) | 1056 (17.6 min) | 9.8 | **Total step 5: 2715s = 45.3 min — fastest yet.** policy_train 810s. +4pp over step 4. Mean step time over 1-5: 46.5 min. |
+| 6 (train batch) | pass@16 **0.547** | -0.968 | 0.297 | 1655 (27.6 min) | 1062 (17.7 min) | 9.3 | **Total step 6: 2734s = 45.6 min.** Mean steps 1-6: 0.599 (vs 4k mean 1-6 = 0.387, +21pp). Mean step time 46.4 min. |
+| 7 (train batch) | pass@16 **0.570** | -1.000 | 0.279 | 1690 (28.2 min) | 1078 (18.0 min) | 9.5 | **Total step 7: 2777s = 46.3 min.** Trend: 0.586, 0.656, 0.594, 0.586, 0.625, 0.547, 0.570 — pass@16 stuck around 0.59 mean. Need many more steps to see real RL gradient. |
+| 8 (train batch) | pass@16 **0.617** | -0.815 | 0.342 | 1645 (27.4 min) | 1035 (17.2 min) | 9.3 | **Total step 8: 2698s = 45.0 min — new fastest.** policy_train 803s. +5pp over step 7. Mean steps 1-8: 0.598. Mean step time 46.2 min. |
+| 9 (train batch) | pass@16 **0.648** | -0.701 | 0.386 | 1599 (26.7 min, fastest) | 1030 (17.2 min) | 9.3 | **Total step 9: 2646s = 44.1 min — new fastest.** policy_train 800s. New peak pass@16. Mean steps 1-9: 0.604. Mean step time 46.0 min. |
+| 10 (train batch) | pass@16 **0.742** | -0.526 | 0.425 | 1642 (27.4 min) | 1024 (17.1 min) | 9.6 | **Total step 10: 2683s = 44.7 min.** policy_train 796s. **Big jump: +9pp over step 9, +16pp over step 1.** Mean steps 1-10: 0.617. Mean step time 45.8 min. |
+| 10 (eval) | pass@32 **0.600** (18/30) | avg_score -0.298 | 0.351 | — | — | 820s eval | **+6.7pp over baseline 0.533.** avg tokens 6943 (vs 7229 baseline → -286), correct-answer 4710 (vs 4939 → -229) — slightly shorter responses, clear improvement. Already past 4k step 20 (0.500). |
+
diff --git a/examples/train/megatron/run_megatron_dapo_nemotron3_nano.sh b/examples/train/megatron/run_megatron_dapo_nemotron3_nano.sh
@@ -0,0 +1,144 @@
+set -x
+
+# Use the legacy (non-chunked) inference path to avoid the vLLM 0.20
+# layerwise-reload corruption that derails post-sync generation for nemotron_h.
+# See PROGRESS.md / gsm8k_run09 → run11 for the diagnosis.
+export _SKYRL_USE_NEW_INFERENCE=0
+# NOTE: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True is incompatible with
+# vLLM's CuMemAllocator (assertion in vllm/device_allocator/cumem.py:132,
+# pytorch/pytorch#147851). Rely on smaller micro batches + shorter
+# MAX_RESPONSE_LENGTH instead.
+
+# Colocated DAPO training+generation for Nemotron3-Nano-30B-A3B on DAPO with Megatron.
+# Should run on 1 node of 8xB2000
+
+# bash examples/train/algorithms/dapo/prepare_dapo_data.sh
+# bash examples/train/megatron/run_megatron_dapo_nemotron3_nano.sh
+
+MODEL_NAME="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
+DATA_DIR="$HOME/data/dapo"
+TRAIN_FILE="$DATA_DIR/dapo-math-17k-cleaned.parquet"
+TEST_FILE="$DATA_DIR/aime-2024-cleaned.parquet"
+NUM_NODES=1
+NUM_GPUS_PER_NODE=8
+NUM_INFERENCE_ENGINES=1
+INFERENCE_ENGINE_TENSOR_PARALLEL_SIZE=8
+LOGGER="wandb"  # change to "console" to print to stdout
+
+CLIP_RATIO_LOW=0.2
+CLIP_RATIO_HIGH=0.28
+# use token mean loss reduction
+LOSS_REDUCTION="token_mean"
+# applies overlong filtering (but not soft overlong punishment)
+APPLY_OVERLONG_FILTERING=true
+# apply soft overlong punishment with custom trainer impl in main_dapo.py
+OVERLONG_BUFFER_LEN=$((1024 * 2))
+OVERLONG_BUFFER_PENALTY_FACTOR=1.0
+
+# other DAPO parameters
+USE_KL_LOSS=false
+TEMPERATURE=1.0
+TOP_P=1.0
+EVAL_TOP_P=0.7
+CLIP_RATIO_C=10.0
+MAX_PROMPT_LENGTH=$((1024 * 2))
+# Reduced from 8192 to 4096 for the overnight smoke run — full 8k responses
+# pushed Megatron's packed activations OOM (run01) and we don't have headroom
+# at this batch size. AIME problems usually fit in 4k.
+MAX_RESPONSE_LENGTH=$((1024 * 8))
+
+# repro run parameters
+TRAIN_BATCH_SIZE=128
+MINI_BATCH_SIZE=32
+N_SAMPLES_PER_PROMPT=16
+EVAL_N_SAMPLES_PER_PROMPT=32
+ENFORCE_EAGER=true # cuda graphs can cause some instability
+LR=1e-6
+
+# megatron config
+MEGATRON_TP=4
+MEGATRON_PP=1
+MEGATRON_CP=1
+MEGATRON_EP=8
+MEGATRON_ETP=1
+
+
+# TIS parameters
+TIS_IMP_RATIO_CAP=2.0
+TIS_TYPE=token
+
+OPTIMIZER_OFFLOAD_FRACTION=1.0
+OPTIMIZER_CPU_OFFLOAD=true
+
+uv run --isolated --extra megatron -m examples.train.algorithms.dapo.main_dapo \
+  data.train_data="['$TRAIN_FILE']" \
+  data.val_data="['$TEST_FILE']" \
+  trainer.algorithm.advantage_estimator="grpo" \
+  trainer.algorithm.policy_loss_type="dual_clip" \
+  trainer.algorithm.overlong_buffer_len=$OVERLONG_BUFFER_LEN \
+  trainer.algorithm.overlong_buffer_penalty_factor=$OVERLONG_BUFFER_PENALTY_FACTOR \
+  trainer.algorithm.loss_reduction=$LOSS_REDUCTION \
+  generator.inference_engine.enforce_eager=$ENFORCE_EAGER \
+  generator.apply_overlong_filtering=$APPLY_OVERLONG_FILTERING \
+  generator.sampling_params.temperature=$TEMPERATURE \
+  generator.sampling_params.top_p=$TOP_P \
+  generator.eval_sampling_params.top_p=$EVAL_TOP_P \
+  generator.eval_sampling_params.temperature=$TEMPERATURE \
+  generator.eval_sampling_params.max_generate_length=$MAX_RESPONSE_LENGTH \
+  trainer.algorithm.use_kl_loss=$USE_KL_LOSS \
+  trainer.algorithm.clip_ratio_c=$CLIP_RATIO_C \
+  trainer.policy.model.path="$MODEL_NAME" \
+  trainer.placement.colocate_all=true \
+  trainer.strategy=megatron \
+  trainer.placement.policy_num_nodes=$NUM_NODES \
+  trainer.placement.policy_num_gpus_per_node=$NUM_GPUS_PER_NODE \
+  generator.inference_engine.num_engines=$NUM_INFERENCE_ENGINES \
+  generator.inference_engine.tensor_parallel_size=$INFERENCE_ENGINE_TENSOR_PARALLEL_SIZE \
+  trainer.policy.megatron_config.tensor_model_parallel_size=$MEGATRON_TP \
+  trainer.policy.megatron_config.pipeline_model_parallel_size=$MEGATRON_PP \
+  trainer.policy.megatron_config.context_parallel_size=$MEGATRON_CP \
+  trainer.policy.megatron_config.expert_model_parallel_size=$MEGATRON_EP \
+  trainer.policy.megatron_config.expert_tensor_parallel_size=$MEGATRON_ETP \
+  trainer.policy.megatron_config.optimizer_config_kwargs.optimizer_offload_fraction=$OPTIMIZER_OFFLOAD_FRACTION \
+  trainer.policy.megatron_config.optimizer_config_kwargs.optimizer_cpu_offload=$OPTIMIZER_CPU_OFFLOAD \
+  trainer.policy.megatron_config.optimizer_config_kwargs.use_precision_aware_optimizer=$OPTIMIZER_CPU_OFFLOAD \
+  trainer.policy.megatron_config.optimizer_config_kwargs.overlap_cpu_optimizer_d2h_h2d=$OPTIMIZER_CPU_OFFLOAD \
+  trainer.algorithm.off_policy_correction.tis_ratio_type=$TIS_TYPE \
+  trainer.algorithm.off_policy_correction.token_tis_ratio_clip_high=$TIS_IMP_RATIO_CAP \
+  trainer.epochs=20 \
+  trainer.algorithm.eps_clip_low=$CLIP_RATIO_LOW \
+  trainer.algorithm.eps_clip_high=$CLIP_RATIO_HIGH \
+  trainer.eval_batch_size=1024 \
+  trainer.eval_before_train=true \
+  trainer.eval_interval=10 \
+  trainer.update_epochs_per_batch=1 \
+  trainer.train_batch_size=$TRAIN_BATCH_SIZE \
+  trainer.policy_mini_batch_size=$MINI_BATCH_SIZE \
+  trainer.micro_forward_batch_size_per_gpu=2 \
+  trainer.micro_train_batch_size_per_gpu=1 \
+  trainer.ckpt_interval=-1 \
+  trainer.max_prompt_length=$MAX_PROMPT_LENGTH \
+  generator.sampling_params.max_generate_length=$MAX_RESPONSE_LENGTH \
+  trainer.policy.optimizer_config.lr=$LR \
+  trainer.policy.optimizer_config.num_warmup_steps=40 \
+  trainer.policy.optimizer_config.weight_decay=0.1 \
+  trainer.policy.optimizer_config.max_grad_norm=1.0 \
+  generator.inference_engine.backend=vllm \
+  generator.inference_engine.run_engines_locally=true \
+  generator.inference_engine.weight_sync_backend=nccl \
+  generator.inference_engine.async_engine=false \
+  generator.batched=true \
+  environment.env_class=aime \
+  generator.n_samples_per_prompt=$N_SAMPLES_PER_PROMPT \
+  generator.eval_n_samples_per_prompt=$EVAL_N_SAMPLES_PER_PROMPT \
+  generator.inference_engine.gpu_memory_utilization=0.6 \
+  generator.inference_engine.engine_init_kwargs="{moe_backend: triton, max_model_len: 12000}" \
+  trainer.logger="$LOGGER" \
+  trainer.project_name="dapo_nemotron3_nano" \
+  trainer.run_name="dapo_nemotron3_nano_30b_a3b_base_megatron_tp${MEGATRON_TP}_pp${MEGATRON_PP}_cp${MEGATRON_CP}_ep${MEGATRON_EP}_etp${MEGATRON_ETP}_optim_offload_8k_max_response_length" \
+  trainer.export_path="$HOME/exports/dapo_nemotron3_nano_30b_a3b_base_megatron_tp${MEGATRON_TP}_pp${MEGATRON_PP}_cp${MEGATRON_CP}_ep${MEGATRON_EP}_etp${MEGATRON_ETP}_optim_offload_8k_max_response_length" \
+  trainer.hf_save_interval=-1 \
+  trainer.resume_mode=latest \
+  trainer.max_ckpts_to_keep=3 \
+  trainer.ckpt_path="$HOME/ckpts/dapo_nemotron3_nano_30b_a3b_base_megatron_tp${MEGATRON_TP}_pp${MEGATRON_PP}_cp${MEGATRON_CP}_ep${MEGATRON_EP}_etp${MEGATRON_ETP}_optim_offload_8k_max_response_length" \
+  $@
diff --git a/examples/train/megatron/run_megatron_nemotron3_nano.sh b/examples/train/megatron/run_megatron_nemotron3_nano.sh
@@ -0,0 +1,87 @@
+set -x
+
+# Use the legacy (non-chunked) inference path. The new path goes through
+# vLLM's layerwise reload, which re-runs `process_weights_after_loading` and
+# (likely) re-creates view-buffer aliases that corrupt MoE/conv weights for
+# nemotron_h beyond the `conv_weights` skip we already added. Standalone
+# vLLM with HF weights at T=0.7 produces correct gsm8k answers; post-Megatron-
+# sync vLLM produces degenerate output. Legacy path uses CUDA IPC + direct
+# model.load_weights, no reload machinery.
+export _SKYRL_USE_NEW_INFERENCE=0
+
+# Colocated GRPO training+generation for Nemotron3-Nano-30B-A3B on GSM8K with Megatron.
+
+# uv run examples/train/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
+# export WANDB_API_KEY=<your_key_here>
+# bash examples/train/megatron/run_megatron_nemotron3_nano.sh
+
+DATA_DIR="$HOME/data/gsm8k"
+LOGGER="wandb"  # change to "console" to print to stdout
+MODEL_NAME="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
+
+INFERENCE_BACKEND="vllm" # currently only vllm is supported for megatron
+
+NUM_NODES=1
+NUM_GPUS=8
+
+MEGATRON_TP=4
+MEGATRON_PP=1
+MEGATRON_CP=1
+MEGATRON_EP=8
+MEGATRON_ETP=1  
+
+INFERENCE_ENGINE_TP=8
+
+# # Qwen3.5 flags
+# USE_SAMPLE_PACKING=false # sample packing is not yet supported for GDN layers in megatron - see: https://github.com/NVIDIA/Megatron-LM/pull/2644
+
+uv run --isolated --extra megatron -m skyrl.train.entrypoints.main_base \
+  data.train_data="['$DATA_DIR/train.parquet']" \
+  data.val_data="['$DATA_DIR/validation.parquet']" \
+  trainer.algorithm.advantage_estimator="grpo" \
+  trainer.policy.model.path=$MODEL_NAME \
+  trainer.placement.colocate_all=true \
+  trainer.strategy=megatron \
+  trainer.placement.policy_num_nodes=$NUM_NODES \
+  trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
+  trainer.placement.critic_num_gpus_per_node=$NUM_GPUS \
+  trainer.placement.ref_num_gpus_per_node=$NUM_GPUS \
+  generator.inference_engine.num_engines=1 \
+  generator.inference_engine.tensor_parallel_size=$INFERENCE_ENGINE_TP \
+  trainer.policy.megatron_config.tensor_model_parallel_size=$MEGATRON_TP \
+  trainer.policy.megatron_config.pipeline_model_parallel_size=$MEGATRON_PP \
+  trainer.policy.megatron_config.context_parallel_size=$MEGATRON_CP \
+  trainer.policy.megatron_config.expert_model_parallel_size=$MEGATRON_EP \
+  trainer.policy.megatron_config.expert_tensor_parallel_size=$MEGATRON_ETP \
+  trainer.use_sample_packing=true \
+  trainer.epochs=20 \
+  trainer.eval_batch_size=256 \
+  trainer.eval_before_train=false \
+  trainer.eval_interval=5 \
+  trainer.update_epochs_per_batch=1 \
+  trainer.train_batch_size=256 \
+  trainer.policy_mini_batch_size=64 \
+  trainer.micro_forward_batch_size_per_gpu=4 \
+  trainer.micro_train_batch_size_per_gpu=4 \
+  trainer.ckpt_interval=-1 \
+  trainer.max_prompt_length=512 \
+  generator.sampling_params.max_generate_length=3000 \
+  generator.sampling_params.temperature=0.7 \
+  generator.sampling_params.top_p=0.9 \
+  trainer.policy.optimizer_config.lr=1.0e-6 \
+  trainer.algorithm.use_kl_loss=true \
+  generator.inference_engine.backend=$INFERENCE_BACKEND \
+  generator.inference_engine.run_engines_locally=true \
+  generator.inference_engine.weight_sync_backend=nccl \
+  generator.inference_engine.async_engine=false \
+  generator.batched=true \
+  environment.env_class=gsm8k \
+  generator.n_samples_per_prompt=5 \
+  generator.inference_engine.gpu_memory_utilization=0.6 \
+  generator.inference_engine.engine_init_kwargs="{moe_backend: triton, max_model_len: 4096}" \
+  trainer.logger="$LOGGER" \
+  trainer.project_name="nemotron3_nano" \
+  trainer.run_name="nemotron3_nano_megatron" \
+  trainer.resume_mode=null \
+  trainer.ckpt_path="$HOME/ckpts/nemotron3_nano_megatron_ckpt" \
+  $@