|
| 1 | +# True On-Policy between Training and Inference |
| 2 | + |
| 3 | +True on-policy ensures that the log probs generated by inference engine (SGLang) is strictly equal to the one generated by the training Engine. |
| 4 | + |
| 5 | +## Examples |
| 6 | + |
| 7 | +### Example 1 |
| 8 | + |
| 9 | +In this script, we provide a minimal example to use true-on-policy. |
| 10 | + |
| 11 | +```bash |
| 12 | +python examples/true_on_policy/run_simple.py |
| 13 | +``` |
| 14 | + |
| 15 | +### Example 2 |
| 16 | + |
| 17 | +This script contains more features for various use cases, and one flag is about the true on policy feature. |
| 18 | + |
| 19 | +```bash |
| 20 | +python scripts/run_qwen3_4b_fsdp.py --true-on-policy |
| 21 | +``` |
| 22 | + |
| 23 | +In order to quickly see the curve, you may use `--mode debug_minimal`, which will skip evaluation and run generation with a very short output sequence length. Since true on policy is unrelated to OSL or answer correctness, this can be used for quick experiments. |
| 24 | + |
| 25 | +### Other Cases |
| 26 | + |
| 27 | +In order to support true on policy for other cases, please refer to the flags changed in the examples above. |
| 28 | + |
| 29 | +### What is Expected to Observe |
| 30 | + |
| 31 | +After running the training, you can see in wandb that the metric `train/train_rollout_logprob_abs_diff` should be exactly `0`. This indicates that there is no difference between the log probabilities from the training and the inference. Without the feature enabled, this value should be nonzero. |
| 32 | + |
| 33 | +## How it is Implemented |
| 34 | + |
| 35 | +The core idea is to make each and every operation in training and inference be bitwise equal. The main code is implemented in [#566](https://github.com/THUDM/slime/pull/566) and [SGLang#12058](https://github.com/sgl-project/sglang/pull/12058). |
| 36 | + |
| 37 | +Briefly speaking, we handled the following components to make them aligned: |
| 38 | + |
| 39 | +* Attention: We use FA3 backend for both training and inference, since it achieves bitwise equal between prefill and decode operations. |
| 40 | +* GEMM: We use DeepGEMM for fast matrix multiplication while preserving true-on-policy, thanks to its algorithm to pick things like tensor core instructions ([SGLang#12142](https://github.com/sgl-project/sglang/pull/12142)). |
| 41 | +* For other kernels, we align numeric operation details between the two systems for simplicity, such as op dtype, detailed kernels, etc, besides using batch-invariant kernels as a prerequisite. Some operations can also be compiled to speedup ([#603](https://github.com/THUDM/slime/pull/603), [SGLang#12161](https://github.com/sgl-project/sglang/pull/12161)). |
| 42 | + |
| 43 | +In order to more easily align the two parts, we use SGLang's [dumper](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/debug_utils/dumper.py) tool for quick comparisons. (Need [#12622](https://github.com/sgl-project/sglang/pull/12622) and [#12623](https://github.com/sgl-project/sglang/pull/12623) for most convenience.) |
0 commit comments