🚀 Release v0.2.0

⚙️ Advanced Parallelism — FSDP 2, TP & SP for Efficient Training

The headline feature of v0.2 is the new DTensorPolicyWorker.

It enables advanced parallelisms—FSDP 2, Tensor Parallelism, and Sequence Parallelism—letting us scale to 32 B-parameter models.

Enable it via YAML or CLI overrides:

policy.dtensor_cfg.enabled=True \
policy.dtensor_cfg.tensor_parallel_size=8 \
policy.dtensor_cfg.sequence_parallel=True \
policy.dtensor_cfg.activation_checkpointing=True

🧠 Learning Algorithms — DPO (Direct Preference Optimization)

Our algorithm suite now includes DPO, compatible with both FSDP 1 and DTensor.

uv run examples/run_dpo.py

More examples live in the docs.

🔄 Multi-Turn RL — Tool Use, Games & Beyond

We now support multi-turn generation and training with GRPO.

An E2E example of training to play a sliding puzzle game will be available in the next release, but you can try it by cherry-picking this PR: #242

# 8x80GB GPUs recommended
uv run python examples/run_grpo_sliding_puzzle.py

🏋️‍♂️ Large-Model Support — Native PyTorch up to 32 B @ 16k sequence length

FSDP 2 + TP + SP make RL and SFT on 32 B models possible:

uv run ./examples/run_grpo_math.py \
  --config examples/configs/grpo_math_8B.yaml \
  policy.model_name='Qwen/Qwen2.5-32B' \
  policy.generation.vllm_cfg.tensor_parallel_size=4 \
  policy.max_total_sequence_length=16384 \
  cluster.num_nodes=16 \
  policy.dtensor_cfg.enabled=True \
  policy.dtensor_cfg.tensor_parallel_size=8 \
  policy.dtensor_cfg.sequence_parallel=True \
  policy.dtensor_cfg.activation_checkpointing=True

Full multi-node walkthrough in the docs.

🛡️ Environment Isolation — Per-Worker Deps with `uv`

In NeMo RL, workers can now launch cached, isolated uv virtual environments with their own Python dependencies—a setup we’ve found to be significantly faster than Ray’s builtin conda/pip/uv flow. Details here.

🐞 Known Issues

FSDP 1 gradient-clipping bug — see #251
Qwen 32 B perf tweaks coming in the next patch
Gemma3 convergence: #236
SFT/DPO configs default to FSDP1, which is not recommended for 1B models with tied embeddings. #256. Enabling DTensor manually will resolve the error.
V100 configuration: #259
The default SFT and DPO configs in examples/configs set policy.dtensor_cfg.enabled=False, but dtensor must be enabled to run with the default 1B models. Please make sure to set policy.dtensor_cfg.enabled=True when running with the default SFT and DPO configs.

📊 Release Runs

We have provided tensorboard logs to release runs to give you a head start on what to expect from our recipes.

You may download them here and serve them with tensorboard:

mkdir v0.2.0
tar -xzf release_runs.tar.gz -C v0.2.0/
tensorboard serve --logdir v0.2.0/

🚧 Coming soon… : In future releases, we will share a tensorboard viewer to make it easier to view and compare release runs.

What's Changed

fix: ray.sub race condition when overlapping srun commands on same node by @terrykong in #39
feat: add gpu mem and util logging to wandb/tensorboard by @terrykong in #37
ci: tests now run with HF_DATASETS_CACHE to speed up e2e time by @terrykong in #41
fix: update the instructions for multi-node setup; change the title f… by @parthchadha in #78
fix: Mixed Prec memory improvements and better default configs (converge-able) by @SahilJain314 in #32
fix: Remove reference of tokenizer from generation backend (#75) by @parthchadha in #82
feat: unit test metric tracking by @terrykong in #40
fix: unit test error when coverage wasn't specified by @terrykong in #88
ci: temporarily disable CI on main since PRs must be up to date before merge by @terrykong in #91
fix: error out early if ray cluster does not have resources by @parthchadha in #89
ci: skip functional until more capacity available and/or tests speed up by @terrykong in #94
feat: evaluation implement by @yuki-666 in #16
fix: gradient should be averaged instead of summed across mbs by @parthchadha in #86
fix: Use separate step_metric for GPU Monitoring by @yfw in #92
feat: Update sft config to use single GPU by @ashors1 in #90
fix: Grammar nit by @SahilJain314 in #98
feat: add capability to set min/max eps separately as proposed in the… by @parthchadha in #95
fix: change format messages to out of place by @KiddoZhu in #77
fix: correct version and use setuptools.dynamic metadata for version/readme by @terrykong in #104
fix: remove usage of vllm to get device uuid and instead use nvidia-m… by @parthchadha in #105
fix: Change optional-dependencies to dependency-groups by @hemildesai in #81
feat: Add support for hydra style overrides by @hemildesai in #80
fix: Do not initialize reference model for sft by @ashors1 in #71
fix: change grpo default to use 64 prompts per step and 32 generation… by @parthchadha in #111
feat: use cuda_graph by default for vllm by @parthchadha in #116
fix: ensure that we check for pad_token and not assume pad_token==eos… by @parthchadha in #120
ci: Consolidate tests by @chtruong814 in #27
feat: support local venvs for dependency isolation by @terrykong in #102
fix: make message formatting compatible with tokenizers with no bos/eos token by @ashors1 in #118
fix: reset prefix cache when sleep is called to ensure prefix cache i… by @parthchadha in #112
ci: Fix unit test summary by @chtruong814 in #128
fix: fix error padding by @yuki-666 in #87
feat: Distributed checkpointing by @ashors1 in #99
ci: Add DCO placeholder check for merge queue by @chtruong814 in #147
ci: Clarify DCO check in merge_group by @chtruong814 in #154
fix: host ip resolution uses ray vs socket by @terrykong in #153
test: Add grpo/reinforce/ppo loss tests (prep for incoming vocab parallel changes) by @SahilJain314 in #162
fix: always test vllm by @parthchadha in #167
docs: Fix doc build warnings and add external CI config by @mckimn in #157
fix: allow configuring ray ports in ray.sub in case conflict on cluster by @terrykong in #173
feat: support arbitrary end_strings by @yuki-666 in #96
ci: labels for docs/L0/L1/L2 and run even if only doc test by @terrykong in #181
fix: don't use cuda-graphs for vllm generation by @parthchadha in #187
ci: Update to include public/ folder for pages deployment by @mckimn in #182
docs: run tests with --group test to avoid missing test deps by @terrykong in #188
fix: default to less verbose logging + uv-venv log once per worker by @terrykong in #141
docs: Correcting file names by @aschilling-nv in #161
fix: convert DCP to HF script works without ray cluster by @terrykong in #185
docs: remove backticks from uv.md title by @terrykong in #179
feat: add a unique seed for each vllm llm engine by @parthchadha in #171
fix: unit test script halts on first failure by @terrykong in #189
feat: Upgrade to vllm v1 runtime by @parthchadha in #170
ci: Run tests only in merge queue or when labeled by @chtruong814 in #159
fix: chat template improvements by @ashors1 in #148
ci: Only include dependencies in test container by @chtruong814 in #203
feat: Fix CPU offloading + add options for FSDP offload and activation ckpting by @yfw in #123
fix: prevent division by zero in ClippedPGLossFn calculation by @zpqiu in #166
fix: ci uses umask by @terrykong in #211
feat: Add FSDP2, DTensor SP/TP, activation checkpointing support by @gshennvm in #131
fix: grpo func test 10 step -> 3 step to speed up CI by @terrykong in #209
fix: fix chat_template in eval by @yuki-666 in #210
feat: Add total logging of generations in training by @SahilJain314 in #172
feat: introduce a debug API for backoff and retries for RayVirtualCluster by @terrykong in #234
docs: update docs everywhere to remove uv pip install which isn't reliable by @terrykong in #217
feat: FSDP2 SFT by @yfw in #206
fix: Fix missing import by @yfw in #222
fix: skip vllm p2p check since its flaky by @parthchadha in #238
ci: Remove external config from project by @mckimn in #200
feat: DPO by @ashors1 in #180
feat: Support multi-epoch training in SFT by @ashors1 in #177
fix: Move ray worker port range start from 20001 to 53001 by @terrykong in #235
fix: Speed up DPO functional test by @ashors1 in #241
feat: Add support for multi-turn generations and RL (tools, games, etc) by @SahilJain314 in #218
feat: Importance sampling trick by @yfw in #174
feat: streaming each dtensor in refit by @yuki-666 in #176
fix: Fix indent in dtensor policy by @ashors1 in #248
fix: raise error if tied weights model is being trained with fsdp1 or… by @parthchadha in #229
fix: use find_tied_parameters api from HF for tied weight keys by @parthchadha in #250
ci: L1 default and increase test time by @terrykong in #252
fix: fix broken eval script by @parthchadha in #253
docs: add qwen 32b instruction and add 0.3 planned features by @terrykong in #255

New Contributors

@yuki-666 made their first contribution in #16
@KiddoZhu made their first contribution in #77
@hemildesai made their first contribution in #81
@mckimn made their first contribution in #157
@aschilling-nv made their first contribution in #161
@zpqiu made their first contribution in #166
@gshennvm made their first contribution in #131

Full Changelog: v0.1.0...v0.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v0.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🚀 Release v0.2.0

⚙️ Advanced Parallelism — FSDP 2, TP & SP for Efficient Training

🧠 Learning Algorithms — DPO (Direct Preference Optimization)

🔄 Multi-Turn RL — Tool Use, Games & Beyond

🏋️‍♂️ Large-Model Support — Native PyTorch up to 32 B @ 16k sequence length

🛡️ Environment Isolation — Per-Worker Deps with `uv`

🐞 Known Issues

📊 Release Runs

What's Changed

New Contributors

Contributors

Uh oh!

Release v0.2.0

🚀 Release v0.2.0

⚙️ Advanced Parallelism — FSDP 2, TP & SP for Efficient Training

🧠 Learning Algorithms — DPO (Direct Preference Optimization)

🔄 Multi-Turn RL — Tool Use, Games & Beyond

🏋️‍♂️ Large-Model Support — Native PyTorch up to 32 B @ 16k sequence length

🛡️ Environment Isolation — Per-Worker Deps with uv

🐞 Known Issues

📊 Release Runs

What's Changed

New Contributors

Contributors

Uh oh!

🛡️ Environment Isolation — Per-Worker Deps with `uv`