document on policy training #613

cmunley1 · 2026-01-28T01:46:48Z

add docs for how gym and RL enforces monotonicity and performs on policy token id corrections. add hypothetical docs on how to disable these checks for non monotonic trajectories, eg qwen3 thinking or agents with context management

disabling would be done by
NVIDIA-NeMo/RL#1812
potentially in NVIDIA-NeMo/RL#1779

Signed-off-by: Christian Munley <cmunley@nvidia.com>

cmunley1 added 3 commits January 27, 2026 17:45

document on policy training

8bfd0cb

Signed-off-by: Christian Munley <cmunley@nvidia.com>

small fix

11bc5ea

Signed-off-by: Christian Munley <cmunley@nvidia.com>

move location

9ad3812

Signed-off-by: Christian Munley <cmunley@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document on policy training #613

document on policy training #613

Uh oh!

cmunley1 commented Jan 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

document on policy training #613

Are you sure you want to change the base?

document on policy training #613

Uh oh!

Conversation

cmunley1 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cmunley1 commented Jan 28, 2026 •

edited

Loading