Skip to content

Release v0.2.0

Choose a tag to compare

@terrykong terrykong released this 24 Apr 16:38
· 473 commits to main since this release

🚀 Release v0.2.0

⚙️ Advanced Parallelism — FSDP 2, TP & SP for Efficient Training

The headline feature of v0.2 is the new DTensorPolicyWorker.

It enables advanced parallelisms—FSDP 2, Tensor Parallelism, and Sequence Parallelism—letting us scale to 32 B-parameter models.

Enable it via YAML or CLI overrides:

policy.dtensor_cfg.enabled=True \
policy.dtensor_cfg.tensor_parallel_size=8 \
policy.dtensor_cfg.sequence_parallel=True \
policy.dtensor_cfg.activation_checkpointing=True

🧠 Learning Algorithms — DPO (Direct Preference Optimization)

Our algorithm suite now includes DPO, compatible with both FSDP 1 and DTensor.

uv run examples/run_dpo.py

More examples live in the docs.

🔄 Multi-Turn RL — Tool Use, Games & Beyond

We now support multi-turn generation and training with GRPO.

An E2E example of training to play a sliding puzzle game will be available in the next release, but you can try it by cherry-picking this PR: #242

# 8x80GB GPUs recommended
uv run python examples/run_grpo_sliding_puzzle.py 

🏋️‍♂️ Large-Model Support — Native PyTorch up to 32 B @ 16k sequence length

FSDP 2 + TP + SP make RL and SFT on 32 B models possible:

uv run ./examples/run_grpo_math.py \
  --config examples/configs/grpo_math_8B.yaml \
  policy.model_name='Qwen/Qwen2.5-32B' \
  policy.generation.vllm_cfg.tensor_parallel_size=4 \
  policy.max_total_sequence_length=16384 \
  cluster.num_nodes=16 \
  policy.dtensor_cfg.enabled=True \
  policy.dtensor_cfg.tensor_parallel_size=8 \
  policy.dtensor_cfg.sequence_parallel=True \
  policy.dtensor_cfg.activation_checkpointing=True

Full multi-node walkthrough in the docs.

🛡️ Environment Isolation — Per-Worker Deps with uv

In NeMo RL, workers can now launch cached, isolated uv virtual environments with their own Python dependencies—a setup we’ve found to be significantly faster than Ray’s builtin conda/pip/uv flow. Details here.


🐞 Known Issues

  • FSDP 1 gradient-clipping bug — see #251
  • Qwen 32 B perf tweaks coming in the next patch
  • Gemma3 convergence: #236
  • SFT/DPO configs default to FSDP1, which is not recommended for 1B models with tied embeddings. #256. Enabling DTensor manually will resolve the error.
  • V100 configuration: #259
  • The default SFT and DPO configs in examples/configs set policy.dtensor_cfg.enabled=False, but dtensor must be enabled to run with the default 1B models. Please make sure to set policy.dtensor_cfg.enabled=True when running with the default SFT and DPO configs.

📊 Release Runs

We have provided tensorboard logs to release runs to give you a head start on what to expect from our recipes.

You may download them here and serve them with tensorboard:

mkdir v0.2.0
tar -xzf release_runs.tar.gz -C v0.2.0/
tensorboard serve --logdir v0.2.0/

🚧 Coming soon… : In future releases, we will share a tensorboard viewer to make it easier to view and compare release runs.

What's Changed

  • fix: ray.sub race condition when overlapping srun commands on same node by @terrykong in #39
  • feat: add gpu mem and util logging to wandb/tensorboard by @terrykong in #37
  • ci: tests now run with HF_DATASETS_CACHE to speed up e2e time by @terrykong in #41
  • fix: update the instructions for multi-node setup; change the title f… by @parthchadha in #78
  • fix: Mixed Prec memory improvements and better default configs (converge-able) by @SahilJain314 in #32
  • fix: Remove reference of tokenizer from generation backend (#75) by @parthchadha in #82
  • feat: unit test metric tracking by @terrykong in #40
  • fix: unit test error when coverage wasn't specified by @terrykong in #88
  • ci: temporarily disable CI on main since PRs must be up to date before merge by @terrykong in #91
  • fix: error out early if ray cluster does not have resources by @parthchadha in #89
  • ci: skip functional until more capacity available and/or tests speed up by @terrykong in #94
  • feat: evaluation implement by @yuki-666 in #16
  • fix: gradient should be averaged instead of summed across mbs by @parthchadha in #86
  • fix: Use separate step_metric for GPU Monitoring by @yfw in #92
  • feat: Update sft config to use single GPU by @ashors1 in #90
  • fix: Grammar nit by @SahilJain314 in #98
  • feat: add capability to set min/max eps separately as proposed in the… by @parthchadha in #95
  • fix: change format messages to out of place by @KiddoZhu in #77
  • fix: correct version and use setuptools.dynamic metadata for version/readme by @terrykong in #104
  • fix: remove usage of vllm to get device uuid and instead use nvidia-m… by @parthchadha in #105
  • fix: Change optional-dependencies to dependency-groups by @hemildesai in #81
  • feat: Add support for hydra style overrides by @hemildesai in #80
  • fix: Do not initialize reference model for sft by @ashors1 in #71
  • fix: change grpo default to use 64 prompts per step and 32 generation… by @parthchadha in #111
  • feat: use cuda_graph by default for vllm by @parthchadha in #116
  • fix: ensure that we check for pad_token and not assume pad_token==eos… by @parthchadha in #120
  • ci: Consolidate tests by @chtruong814 in #27
  • feat: support local venvs for dependency isolation by @terrykong in #102
  • fix: make message formatting compatible with tokenizers with no bos/eos token by @ashors1 in #118
  • fix: reset prefix cache when sleep is called to ensure prefix cache i… by @parthchadha in #112
  • ci: Fix unit test summary by @chtruong814 in #128
  • fix: fix error padding by @yuki-666 in #87
  • feat: Distributed checkpointing by @ashors1 in #99
  • ci: Add DCO placeholder check for merge queue by @chtruong814 in #147
  • ci: Clarify DCO check in merge_group by @chtruong814 in #154
  • fix: host ip resolution uses ray vs socket by @terrykong in #153
  • test: Add grpo/reinforce/ppo loss tests (prep for incoming vocab parallel changes) by @SahilJain314 in #162
  • fix: always test vllm by @parthchadha in #167
  • docs: Fix doc build warnings and add external CI config by @mckimn in #157
  • fix: allow configuring ray ports in ray.sub in case conflict on cluster by @terrykong in #173
  • feat: support arbitrary end_strings by @yuki-666 in #96
  • ci: labels for docs/L0/L1/L2 and run even if only doc test by @terrykong in #181
  • fix: don't use cuda-graphs for vllm generation by @parthchadha in #187
  • ci: Update to include public/ folder for pages deployment by @mckimn in #182
  • docs: run tests with --group test to avoid missing test deps by @terrykong in #188
  • fix: default to less verbose logging + uv-venv log once per worker by @terrykong in #141
  • docs: Correcting file names by @aschilling-nv in #161
  • fix: convert DCP to HF script works without ray cluster by @terrykong in #185
  • docs: remove backticks from uv.md title by @terrykong in #179
  • feat: add a unique seed for each vllm llm engine by @parthchadha in #171
  • fix: unit test script halts on first failure by @terrykong in #189
  • feat: Upgrade to vllm v1 runtime by @parthchadha in #170
  • ci: Run tests only in merge queue or when labeled by @chtruong814 in #159
  • fix: chat template improvements by @ashors1 in #148
  • ci: Only include dependencies in test container by @chtruong814 in #203
  • feat: Fix CPU offloading + add options for FSDP offload and activation ckpting by @yfw in #123
  • fix: prevent division by zero in ClippedPGLossFn calculation by @zpqiu in #166
  • fix: ci uses umask by @terrykong in #211
  • feat: Add FSDP2, DTensor SP/TP, activation checkpointing support by @gshennvm in #131
  • fix: grpo func test 10 step -> 3 step to speed up CI by @terrykong in #209
  • fix: fix chat_template in eval by @yuki-666 in #210
  • feat: Add total logging of generations in training by @SahilJain314 in #172
  • feat: introduce a debug API for backoff and retries for RayVirtualCluster by @terrykong in #234
  • docs: update docs everywhere to remove uv pip install which isn't reliable by @terrykong in #217
  • feat: FSDP2 SFT by @yfw in #206
  • fix: Fix missing import by @yfw in #222
  • fix: skip vllm p2p check since its flaky by @parthchadha in #238
  • ci: Remove external config from project by @mckimn in #200
  • feat: DPO by @ashors1 in #180
  • feat: Support multi-epoch training in SFT by @ashors1 in #177
  • fix: Move ray worker port range start from 20001 to 53001 by @terrykong in #235
  • fix: Speed up DPO functional test by @ashors1 in #241
  • feat: Add support for multi-turn generations and RL (tools, games, etc) by @SahilJain314 in #218
  • feat: Importance sampling trick by @yfw in #174
  • feat: streaming each dtensor in refit by @yuki-666 in #176
  • fix: Fix indent in dtensor policy by @ashors1 in #248
  • fix: raise error if tied weights model is being trained with fsdp1 or… by @parthchadha in #229
  • fix: use find_tied_parameters api from HF for tied weight keys by @parthchadha in #250
  • ci: L1 default and increase test time by @terrykong in #252
  • fix: fix broken eval script by @parthchadha in #253
  • docs: add qwen 32b instruction and add 0.3 planned features by @terrykong in #255

New Contributors

Full Changelog: v0.1.0...v0.2.0